chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
272 stars 59 forks source link

Error reading from ~/path-to-fastq/*.fastq near offset 0 #219

Closed agnibhat closed 3 years ago

agnibhat commented 3 years ago

Hi,

I am trying to assemble my nanopore sequencing reads using Shasta. I am encountering an error which in principle says what I have mentioned in the subject above. I am attaching a log file for your reference. Kind of lost in trying to figure out where the error is coming from. Any help is much appreciated.

I am using a MacBook for now with very basic resources. The assembly genome is a considerably small one with ~24Mb

P.S - I am a newbie at bioinformatic analysis. shasta.log

paoloczi commented 3 years ago

Please post the output of the following command:

ls -l /Users/akv4001/Desktop/Seqdata/Shasta/nf54-816-h6merge.fastq
agnibhat commented 3 years ago

-rw-r--r--@ 1 akv4001 staff 2791787283 Dec 18 17:50 /Users/akv4001/Desktop/Seqdata/Shasta/nf54-816-h6merge.fastq

paoloczi commented 3 years ago

It is possible that the macOS version of Shasta has a 2 GB limit on the size of a file it can read. I suggest converting the fastq file to fasta, which will approximately reduce its size by a factor of two, or split it into two files, each less than 2 GB.

The Linux version has no such limitation and is able to process files that are hundreds of GB in size.

How much memory does your Mac have? It is likely that you will need at least 12 to 16 GB to run this assembly (this is a separate issue from the problem you are seeing now).

agnibhat commented 3 years ago

Thanks! I will try it. My Mac has 8 GB memory. So it might be a problem. I will try to run it on a Linux machine and see if it works. How much time it might take to assemble ~24 Mb genome with 16 cores and 64 GB RAM? I might have access to this system in the near future.

paoloczi commented 3 years ago

That assembly should take just a few minutes on a machine like you described.

paoloczi commented 3 years ago

For best results and assuming you have recent nanopore data, make sure to use config file shasta/conf/Nanopore-Sep2020.conf. Use command line option --config to specify the configuration file, and you can download the file from the Github repository, or get it from the tar file for the current release.

paoloczi commented 3 years ago

I am closing this due to lack of discussion, but feel free to reopen it or create a new issue if more information emerges.

ndierckx commented 3 years ago

I have the same issue on linux with 500 GB of RAM available. The reads were merged in to one fastq file of 68 GB...

paoloczi commented 3 years ago

Please provide the following information:

If you are using option --Reads.noCache (you might be getting that through a configuration file), try removing it and see if the problem still occurs.

If you provide the above information, I may be able to give suggestions.

ndierckx commented 3 years ago

Thanks for the quick reply

This is the version: "CentOS Linux 7 (Core)"

Got this from stat -f ID: ef0009600000002 Namelen: 255 Type: gpfs Block size: 16777216 Fundamental block size: 16777216 Blocks: Total: 76021760 Free: 23653323 Available: 23653323 Inodes: Total: 402653184 Free: 217161914

I am running now by splitting the file in <2GB files, seems to work now Will try to remove this in the next run --Reads.noCache

paoloczi commented 3 years ago

There is a known issue #202 when using --Reads.noCache on the gpfs filesystem. Splitting the file will not help. It should work if you remove --Reads.noCache, altough this might cause some reduction in assembly performance, depending on your machine configuration.

Hopefully this will be fixed in the next release. On read failure, we should automatically turn off --Reads.noCache and retry.

ndierckx commented 3 years ago

Ok will try later, but it is still running after splitting in to small files. Currently "computing marker graph vertices"

paoloczi commented 3 years ago

Oh cool. We have seen strange things happen with gpfs, so this adds to the list. For the future, if your data are on gpfs I suggest just taking out --Reads.noCache, without having to worry about splitting the file.

ndierckx commented 3 years ago

Ok will do for the next run, thanks for the help and the great tool!