BioInf-Wuerzburg / proovread

PacBio hybrid error correction through iterative short read consensus
MIT License
60 stars 20 forks source link

Results problems #144

Closed q1134269149 closed 4 years ago

q1134269149 commented 5 years ago

Hi, I used proovread to correct my Nanopore Direct RNA sequencing data with illumina sequencing data, but the output .trimmed.f[aq] data is so small. The ori long-reads fastq data is 788M, however, the output .trimmed.fastq only 27M, the output .trimmed.fasta only 14M , and the .untrimmed.fq is 674M. In .untrimmed.fq, the sequence contains a large number of "N" as following: """ @0a0a1a5d-1c34-49eb-8bbc-d2373b5bf9d5_Basecall_1D_template CACAANNNNCAGANACNCNGAANCGAAGCNAAAGANNCGCNGCCANGGACGCCNNGCNGANNCCGNCGNANNGANCCACCAGAGACNNCGCNNAAGGANAGNANNCGNNCGNNAAGCGNNGCCACAAGCCAGANCGCAAAAGAANNNGAAAGNNGCAGNNNGNCNGCGANNAAGNNNGNGGNAAANAGGANACNNGCGNNCNNNGNGAAGCNGNNNCANNNCCGANCAACAACANCANCGNCGGNGCCACNNAGAAAGCAAAGGANACACCANGGNNCGGNNNANGNNNGAGAAGGAAGANNNGAGGNNANCGNGNNNNNAGNGAAAGGCACCGGNNNCAGNNNGNGNNNNGNGNAANGGANCCGNNNNGNNCAANGNNNNNNNGGNNCNNNNGNNNACNGAAAANCGACGNNNNGAANGNAAAGAGCAAACNAANNNAAAGCNGNGGGANNCGCNGANGCCAGCNNGGNCNGNNCACNAANCCCCACACACNCCCACCANNNANNCCNCGCCCANCCAAANCCAGG + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! @0a0a5b3d-3ae7-41c8-a9f9-231ca7fc22d4_Basecall_1D_template CNGAAAACCAAAAGAAGAAGAGAAACAACAAGAAGAAGNAANGGTGNCTTCCTCTATGTTCTCCTCCACCGCNGNGGNNACCCTCCCCGGCNCAAGCCACCATGGNCGCNCCATTCACCGGCTTGAAGTCATCCGCTTCTTTNCCCGGTCACCCGCAAGGCCAACAACGACNACTGCCATCACAAGCAACGGAGGAAGAGTTANGCNGCATGAAGGGTGGCCACCAATCGGAAAGAAGAAGNNGAGACTCNANCNCCCNGACCNNCNGGNGACGNNGAANNGGCNAAGGGAAGNNGACNACCNCNCCGCANCAAGNGGANNCCNGNGNNGAANCAGNNGGAGCACGGANNNGCGGNCCNNGAGCACGGAAAACACNCCCGGANACNANGANGGACGANACNGGACAANGNGGAAGCNNCCAACNGNNCGGANGCACCACCGACNCCGCNCNGAGNGNNGAAGGAAGNNGAAGAAAACGGCAAGAAGGAGNACCNGGCGCCNNNANGGGANCANCGGANNNCCGACAACACCCGNCAAGNCCAANGCANCAGNNNNCANNGCCCNAAGCCCCAGCNNCNCCNNGGGNCNAANCCCCNNCNGGAANANNCAGCGNNGANNANNANNCNGGAACACCANNNCNANGNGGNCAANGCAAANNNAAGAAANNANNNGCCGANCGCAGNNGAGGAACNANNGNNNGAAAGNGAAAANGNNANNCCNANCAGNNNCNAANNANAGNNANCANNCAAANCCCANCCCANNNCANCCANCCCANCCCNANCCAAAANCCCNCAGG + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! @0a0a6f37-23c3-466c-baef-d43fff52a4de_Basecall_1D_template NGGAGAGNAAAGNAGAACAANGCCGAANCCNNANGNNCCGCANANNGNNGCCAAACGGGCCNNNANAGCACGCNCCACAACGGACNNAAGNCCNCGCNGCACCNNCCCCAGCCACCCGCAGGCNAACCNACAGGCNACNNCCAACACAAGCAACGGCGGAGNNANNGCANANGCAGGNGNGGCACCACGNGANNGGAAAAGNAGAGNNNGAGACNCNNNNCCNNCCNGACCNNNCGANNCCNAANNGGGCNAAGGAAGNNGNANCACCCNCANCCGCCAACAAGNNGGANNCCNNGNGNNGAANNCGAGNGGGAAAANACGGANNNANNGNACCGNGAGCACGGNAACNNNCCCGGANANCGNGANGGACGGNACNGGACAANGNGAAGCNNCCCCNGNNCGGNNGCACCGAACNCCGCNCNAAGNGNNGAAGGAAGNAGAAGAGNGCAAGAAGGAGNACCCCAANGCCNNNANGGAGNACANGGANNNCGACAACACCCGNCAAGNCCAGNGCANCAGNNNANNGCCNCACAAGCCACCAAGCNNCGGNNAANNNCCCNNCNGCNNNGNGNAAACCNCAANANNNNANCCCCCCANNGANNNNANCCCNNGNNNNNCNGCNNNNNCNNNNGAGGNNNNAANCNCCGGACNNAACGNNNGNNNNCCACGGNNGCGAGNNNANNNANCGGANNCNCANNGNNNAGCGCAANAANANGNNGNNNAANCCCCAACCCCNACNCCANCCCACCANNNANCCNNCANNACANCCANCCCACCAANNNACACNCCAANCCANN + !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! """ I used illumina short-read data is 150bp and Illumina reads 1 and 2 were merged into fragments using FLASH at least 30X.

I don't know what the problem is. Can you give me some advice? Thanks

thackl commented 5 years ago

Can you send me the log from the proovread run. That would help me understand where things might have gone wrong.

q1134269149 commented 4 years ago

col0_1qf.log This is the log, thanks.

thackl commented 4 years ago

Have a look at https://github.com/BioInf-Wuerzburg/proovread#log-and-statistics, which explains how to read the proovread log. In you case there seems to be an issue with mapping to the nanopore reads.

[Wed Sep 11 10:27:28 2019] Running mode: mr
[Wed Sep 11 10:31:37 2019] Running task bwa-mr-1
[Thu Sep 12 13:22:14 2019] Masked : 26.8%
[Thu Sep 12 13:22:14 2019] Running task bwa-mr-2
[Fri Sep 13 11:18:02 2019] Masked : 27.9%
[Fri Sep 13 11:18:57 2019] Running task bwa-mr-finish
[Sat Sep 14 01:55:09 2019] Masked : 30.0%

These values should be on the order of 80% in good runs. proovread is optimized for PacBio, not for Nanopore. The error profiles between the technologies differ, and it seems here, the mapping is just not sensitive enough for your data. Unfortunately, I don't think there is a good way to increase proovread's sensitivity enough to get good results for your data. You might want to look at other programs that deal with nanopore data specifically.