PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

How to use SRA downloaded files in Falcon #428

Closed ls2017 closed 8 years ago

ls2017 commented 8 years ago

I have a downloaded dataset from SRA, and converted it to *.fastq, sth. like this:

@SRR1168519.1 length=302 ATTTTTGTCTGTCCGATTCTGATAGCAGGC GCATATCAGATGAATCTGATGAGTCAACACTGGTTGGTTCGTTGCTCAGTAGTTATGTTCGTGTGGAGCGTCGTATTGGTATCGAGTCTGATTGTCAGTCATCGATGGTCATTAGTCACGTCCTTCCAGTAGTTCGTATCAACATGCTTCACTATTCTTGTTGTTGTAGATGTTATTCGTATTAGTGTGAGTGTCAGTAGTTACGCGTACAGTATCGGGATTTCGTAGCAGCGCGCGGCGTTGCGGAGTCAAGATTCATGGCTGGACTACGG +SRR1168519.1 length=302 !"!!!"#$"##!!!"!!"!"#""""#$#"!"!""!!!!!""%"""!"!"#""!#"!!!"!#"!#!!!"!!!"""!!!!"""#!!"#"!"!""!"!!!!""#!!!""!!!"!#!"###"#""!"!!!##!#!#!"!"""!"$$!!"#"$""#"!!"!!#"!!#!!!"!"""!!""%#"$#"$"#"!!!"!!!!!"!!!"!"!"!$#%&%%$"""""""!#"!"!!""##"$!!!!!!!$$!!!!!#!!"!!!!%!"$"!!"""!!!!!!!"!!!!!!$$#"!"!!!"!$$#"!$!!!""!"""

After using Falcon-formatter for format conversion, it does NOT work in Falcon.

And it looks like that the fasta files require strict formatting with the information of movie, time of run start, SMRT barcode, etc. and should look like this (copied from ecoli example):

m140913_050931_42139_c100713652400000001823152404301535_s1_p0/9/1607_26058 RQ=0.831 TGGCATCTCATAAAGCCGCGCGGACGGGCAATAGCACTGGTTCGATTGTCTGGTGTTTATTCCCGGCTGT TGGGCTGAGTTTGTGATCCCGGTGAACTTCTCGCATGCCGACAGCATCATGATCGGTGCGCTGTCTCCCT GGCAAATAGAAGTTGTTCAATAACGCGCGCGACTGGCCGTTGGCCTCGGGCGGTTAGCGATGCATCGATG TTTGCTGGGCTGCTAATTGTGCCCGATAATATGGTTGGTTCGGCACTAAACGACCAGCAAAAAAAAGCGT GGGAGAACAGATGAAATTATTTACGCGGTAGTTCGTTTCGCCGCTGGCGGATTGTGATTTTGCTGGCTTG GTCTTACCGTTTTCCTCTACGCGGCCCAATGCTGAGCTGGGTATCTATTCGTTATACGGCTCTGAAGGCT

My question: Can I make up some dummy variables equivalent for ">m140913_050931_42139_c100713652400000001823152404301535_s1_p0/9/1607_26058" to make Falcon work properly?

Or is there another way to dump*.sra file I downloaded to make it work properly in Falcon?

pb-cdunn commented 8 years ago

My question: Can I make up some dummy variables equivalent for ">m140913_050931_42139_c100713652400000001823152404301535_s1_p0/9/1607_26058" to make Falcon work properly?

Yes. This is a restriction in DAZZ_DB/fasta2DB. The header must match >movie/well/blah plus comments if any. All reads from the same movie should be together in the file. well is an integer. blah is ignored.

pb-jchin commented 8 years ago

@jingqinwu you will need to download the bax.h5 files and use pls2fasta to convert to proper fasta. SRA's fasta output does not encode proper information for assembly (yet).

rhallPB commented 8 years ago

Depending on how the files were uploaded they may or may not contain the data needed to correctly format them. Some useful reading: http://microbe.net/2015/01/20/submit-data-to-ncbis-short-read-archive/ http://seqanswers.com/forums/showthread.php?t=56466. If the data isn't in the SRA I would suggest contacting the authors of the study.

ls2017 commented 8 years ago

@pb-jchin @rhallPB Many thanks for your suggestions.

pb-jlandolin commented 8 years ago

See related issue here: https://github.com/pb-jlandolin/PacbioToSRA/issues/2

If they were uploaded by PacBio, they should have links to the original bax.h5 files. You can click on the SRR id, then click on the "Download" tab, and download the original bax.h5 files instead of the .sra files:

screen shot 2016-09-16 at 2 42 49 pm