bcgsc / ntEdit

✏️ Genome assembly polishing & SNV detection
GNU General Public License v3.0
64 stars 9 forks source link

nthit error #3

Closed madhubioinfo closed 5 years ago

madhubioinfo commented 5 years ago

Hi I am trying to run nthits for running ntedit , my command is ../ntHits/nthits -c 2 --outbloom -p solidBF -k 25 -t 48 @reads.in

reads.in file looks like this /home/C2_sequencing/C2_minionreads/C2_illumina_reads/SRR2727640_1.fastq /home/C2_sequencing/C2_minionreads/C2_illumina_reads/SRR2727640_2.fastq.gz /home/C2_sequencing/C2_minionreads/C2_illumina_reads/SRR2727641_1.fastq.gz /home/C2_sequencing/C2_minionreads/C2_illumina_reads/SRR2727641_2.fastq.gz /home/C2_sequencing/C2_minionreads/C2_illumina_reads/SRR2727643_1.fastq.gz /home/C2_sequencing/C2_minionreads/C2_illumina_reads/SRR2727643_2.fastq.gz /home/C2_sequencing/C2_mp_illumina/SRR2727626_1.fastq /home/C2_sequencing/C2_mp_illumina/SRR2727626_2.fastq

and i am getting the following error Reapeat profile estimated using ntCard in (sec): 25.5364 Errors k-mer coverage: 1 Median k-mer coverage: 1 Repeat k-mer coverage: 2 Approximate# of distinct k-mers: 160272833 Approximate# of solid k-mers: 84845428 Error in reading file: /home/C2_sequencing/C2_minionreads/C2_illumina_reads/SRR2727640_1.fastq

warrenlr commented 5 years ago

@mohamadi, can fastqs be mixed compressed/uncompressed? @madhubioinfo, have you checked that your files are properly fastq formatted?

madhubioinfo commented 5 years ago

I checked the manual, the .gz file, fastq mixed works well. it is properly formatted since I used these files for my all other assemblers worked well

mohamadi commented 5 years ago

@warrenlr yes they can be mixed. @madhubioinfo could you please send me the first line in the file below?

/home/C2_sequencing/C2_minionreads/C2_illumina_reads/SRR2727640_1.fastq

The error is reading and processing the first header line in SRR2727640_1.fastq. From the ntCard stat I also see something not correct with input files.

mohamadi commented 5 years ago

@madhubioinfo I tested and ran nthits again with your parameters and a mixed set of files like yours in cel.in:

$ cat cel.in 
ERR294494_1.part2.fastq
ERR294494_1.part3.fastq.gz
ERR294494_1.part5.fastq
ERR294494_2.part1.fastq
ERR294494_2.part4.fastq.gz
ERR294494_2.part6.fastq

$ ~/ntHits/nthits -c 2 --outbloom -p solidBF -k 25 -t 48 @cel.in 
Reapeat profile estimated using ntCard in (sec): 22.4873
Errors k-mer coverage: 6
Median k-mer coverage: 29
Repeat k-mer coverage: 2
Approximate# of distinct k-mers: 237569323
Approximate# of solid k-mers: 99340821
Total time for computing repeat content in (sec): 176.3761
madhubioinfo commented 5 years ago

@SRR2727640.1 1 length=125 TTAATTATTTTAATTAATAACATCAATGAATATAATAAAACATAATAAACACAAAATAATAATAGTACCTTTTTTTTTTAGATTTACTAAAATAAATAATTAATTTATATTTTTTTTTTTTATTA +SRR2727640.1 1 length=125 B/<BBF/<BFFFFF/FFFBBF<F/F/<<<F/<BFFBFBFFFFF<F/BFF/B///B<FBB/B<B</<<B<BFFFF<FFFB############################################## @SRR2727640.2 2 length=125 AGTGGTCAACCACCCTTTGCTACTGAAGAGTACGATATTGATTTAGCTTTAGAAATTTCACAAGGTCGTAGAGAAGAACCCATTCTTAATATGCCTGAAAGTTATATTAAAATTTATACTGTTAC +SRR2727640.2 2 length=125 </<<<FFF//FF//BFFFFBBFFF//F/<<F///7<7</B///<<B/<FFFFF/<<FFBB/F/F<FBBF</BFFFFFBFFB//FFFBBBFF////B/<F/<<FFFFF//7<FF/F/F/7BF####

mohamadi commented 5 years ago

What is the system & environment setup?

madhubioinfo commented 5 years ago

Its ubuntu 18.04

mohamadi commented 5 years ago

Can you share the input data?

madhubioinfo commented 5 years ago

Hi, I don't know how to share the data, but its available in online and here is the link for downloading https://www.ncbi.nlm.nih.gov/sra/SRX1355387[accn]

mohamadi commented 5 years ago

I'll download and check.

warrenlr commented 5 years ago

@madhubioinfo, what's the est. genome size of the fungi?

madhubioinfo commented 5 years ago

130 Mb

warrenlr commented 5 years ago

We are still downloading the data here, and I suspect there is an insane coverage* of the genome (much more than you need for correction). That said, did you try various combinations of files with nthits, to see if only the first file in your reads.in is giving you the error? You could try testing this on your end. You should also do a test by running nthits like this:

../ntHits/nthits -c 2 --outbloom -p solidBF -k 25 -t 48 SRR2727641_1.fastq.gz SRR2727641_2.fastq.gz

Do you still see the same format error?

*back of the envelope calculations are roughly 150X-200X coverage from all your files. If you use only a pair of files, you still have 40X, which is more than enough.

warrenlr commented 5 years ago

also, just occurred to me, could you verify that your SRA download was complete (MD5 sum) or tail on the files; I wonder if a truncated record producing a read with a length less than k may throw an error.

mohamadi commented 5 years ago

@madhubioinfo @warrenlr nthits just successfully finished the run on dataset you sent:

$ cat reads.in 
SRR2727640.fastq
SRR2727641.fastq
SRR2727643.fastq

$ ~/ntHits/nthits -c 2 --outbloom -p solidBF -k 25 -t 48 @reads.in
Reapeat profile estimated using ntCard in (sec): 223.3665
Errors k-mer coverage: 22
Median k-mer coverage: 61
Repeat k-mer coverage: 2
Approximate# of distinct k-mers: 1855503243
Approximate# of solid k-mers: 134025716
Total time for computing repeat content in (sec): 506.7850
madhubioinfo commented 5 years ago

i have run command without reads.in file and i provided all the file in command line and it worked. Thank you very much for helping. I appreciate it for giving me faster response and faster solution.

warrenlr commented 5 years ago

thank you @mohamadi ! @madhubioinfo , this is great, glad you got it working!

For posterity, the original error might have been due to an extra character/space in the file list, or a format (eg. linespace).