AnantharamanLab / VIBRANT

Virus Identification By iteRative ANnoTation
GNU General Public License v3.0
149 stars 37 forks source link

Issue in running .fastq files #37

Closed BikramDroid closed 3 years ago

BikramDroid commented 3 years ago

The Vibrant project is able to run .fasta files perfectly, but any solution to make it run .fastq files as well?

And any easy way to convert these files from fasta to fastq, because the fasq file I have is very big

KrisKieft commented 3 years ago

Hi,

Fastq files will represent reads that are used to assemble scaffolds in Fasta format. VIBRANT was built to identify viruses from assembled scaffolds that are at least 1kb each. Unless you have PacBio/Nanopore reads that are very long then Fastq files will not be useful for analysis. Also, Fastq files are likely to be larger than the assembled Fasta file. I may need more clarification in order to help.

Kris

BikramDroid commented 3 years ago
Screenshot 2021-02-05 at 09 37 40

Hi, so I got the .fasta file but it's really big in size (hundreds of MB), the program was running for quite a few hours then it stopped without any error on terminal, above you can see the log saying some error, many files were generated but not all like figures etc which I'm more concerned about. Can you look into this, or you need more log files?

KrisKieft commented 3 years ago

Hi,

For large fasta files I'd suggest using multiple threads or the runtime will be quite long. As for this error, it's likely due to a bug in an earlier version. Are you using VIBRANT v1.2.1? If not then please update and this issue should resolve.

Kris

BikramDroid commented 3 years ago

Hi,

Thanks for the update. Yes, I'm on the latest version VIBRANT v1.2.1, it was released in March last year and I'm using the same as it started using last month only.

And while it comes to using multiple threads for running this large file, can you hint more on this, sorry but I'm new to this.

It's clear that it's not a version problem then, and I just need to do multithreading here and it should solve the issue and cut the running time.

Looking forward to your solution.

KrisKieft commented 3 years ago

Hi,

Generally, multi-threading means that the job will be split across multiple processes, rather than just running 1. What type of computer are you running this on? For personal laptops you usually can't easily go above 2 or 3 threads. In these cases VIBRANT is difficult to run using large datasets. For computing clusters or servers you'll have to find out how many threads you are able to use. VIBRANT has a -t flag.

I was looking back through previous issues and saw this same one. I don't think I was able to specifically figure it out before. It seemed to best resolve by downloading through GitHub rather than conda (I'm not sure which you did). The other idea I had was to utilize the -folder flag to specify and output directory in case certain files were't being generated due to permissions issues. I see in your error that the paths seem to be a little odd. I'd suggest specifying an output folder using -folder.

BikramDroid commented 3 years ago

Hi again,

Sorry for late reply, was trying it on multiple files. And yes using -t flag, the execution time is cut by lot, even big files getting processed in few minutes, thanks.

The output is ok, except for the figures section. There is no error in main log run file.

If you see below there is warning on scikit-learn module, could that be the culprit here?

Screenshot 2021-02-08 at 15 14 09

And some issue related to loading some scikit modules in log_annotation file.

Screenshot 2021-02-08 at 15 16 39

only figure file that was generated:

Screenshot 2021-02-08 at 15 18 13
KrisKieft commented 3 years ago

Hi,

Yes, that is an important issue. Please install sklearn v0.21.3. There are details in the README.

Kris

BikramDroid commented 3 years ago

It worked very well, thanks for the help :)