fbreitwieser / krakenuniq

🐙 KrakenUniq: Metagenomics classifier with unique k-mer counting for more specific results
GNU General Public License v3.0
224 stars 43 forks source link

No data in report files #115

Open biolougy opened 2 years ago

biolougy commented 2 years ago

I have run the following:

krakenuniq --report-file results_reports_uniq/${OUTPUT}.tsv --db kraken_uniq --preload > results_uniq/${OUTPUT}_class.tsv ${INPUT1} ${INPUT2}

The classified outputs with one line per read are fine, but the reports are only showing the header and then nothing else. I am using version 1.0.0 to process fastq paired reads as an array. Am I missing something?

boulund commented 2 years ago

This could also be due to the bug that was fixed by my PR last week: https://github.com/fbreitwieser/krakenuniq/pull/117 @alekseyzimin are there any plans for a new release anytime soon?

boulund commented 2 years ago

I'm seeing the same behavior as described by @biolougy also in version 1.0.1a when running in Singularity. Maybe it is related? What environment are you executing in @biolougy and how did you install krakenuniq?

biolougy commented 2 years ago

I was using krakenuniq in a unix environment on a high performance cluster. I ran this several times and sometimes I got reports for all of my samples, sometimes I did not. I don't really understand why!

boulund commented 2 years ago

Just to make sure I understand your situation, what does "only showing the header" mean in your case?

In my case when running in Singularity I get the first two commented lines in the report, but nothing else (not even the "header" with the column names), example:

# KrakenUniq v1.0.1 DATE:2022-10-13T10:52:35Z DB:/ceph/db/krakenuniq/minikraken_20171013_4GB DB_SIZE:3758097436...
# CL:/usr/local/bin/krakenuniq --db /ceph/db/krakenuniq/minikraken_20171013_4GB --threads 4 --output output_dir/....

I shortened the lines for brevity in this github comment. There is nothing else in the report file when I run krakenuniq in Singularity (using the latest biocontainer image), just those two lines.

hazmup commented 1 year ago

I am probably getting the same behavior. I am running KrakenUniq v1.0.3 on a Unix HPC. The report file contains only the two lines mentioned above. Also, the runs never finish even though the classification seems to be complete. The STDOUT and STDERR streams also seem to not work properly, but I do not know enough to be certain.

alekseyzimin commented 1 year ago

Hi, how big is your database? What is the version of KrakenUniq and what is your command lime? Are you using any preload switches?

On Fri, Jan 20, 2023 at 8:23 AM Stelios Gkionis @.***> wrote:

I am probably getting the same behavior. I am running Kraken Uniq on a Unix HPC. The runs seem to hang, the report file contains only the two lines mentioned above.

— Reply to this email directly, view it on GitHub https://github.com/fbreitwieser/krakenuniq/issues/115#issuecomment-1398378256, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHLQKBBGRM3Z4Q6PEATWTKGWPANCNFSM6AAAAAAQX2B6CQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

hazmup commented 1 year ago

Hi, sorry I edited the version later. I'm running v1.0.3 in a bash script, using the Standard database so 377GB, no preloading. krakenuniq --db /path/to/database --threads 40 --paired $R1 $R2 --report-file $report_file > $classification_file

hazmup commented 1 year ago

Update: Preloading the database solved the issue for most files in my dataset, but it still occurs for some.

alekseyzimin commented 1 year ago

Yes, your issue may have to do with out-of memory error for some data sets. You can use --preload-size and set it to about half of your physical RAM.

On Thu, Feb 2, 2023 at 6:46 AM Stelios Gkionis @.***> wrote:

Update: Preloading the database solved the issue for most files in my dataset, but it still occurs for some.

— Reply to this email directly, view it on GitHub https://github.com/fbreitwieser/krakenuniq/issues/115#issuecomment-1413608731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGPXGHITVJFOZ2C6HU7C43LWVONAXANCNFSM6AAAAAAQX2B6CQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Dr. Alexey V. Zimin Associate Research Scientist Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA (301)-437-6260 website http://ccb.jhu.edu/people/alekseyz/ blog http://masurca.blogspot.com

hazmup commented 1 year ago

Thank you for your help. Unfortunately it didn't fix the problem. I'm rerunning with an even smaller --preload-size, to see if it makes a difference.

hazmup commented 1 year ago

I run it at 20% available memory (100GB/490GB) and it still fails for a couple of my samples. The memory is not even being reported as fully used. They are indeed my largest samples, so it must have to do with that. Is it possible there is an issue with the temporary files being to large?

hazmup commented 1 year ago

It seems to hang and the last line of the log reads "Processed 65582031 sequences (database chunk 1 of 4)" every time.

salzberg commented 1 year ago

Don't use --preload, but instead use --preload-size (this applies to @biolougy's question above). We probably should just deprecate the --preload option because the newer preload-size is just better. --preload still requires that you have enough RAM for the whole database. I don't know how your systems are configured, but many Unix systems are set up with substantial amounts of swap, so they can never access all of the RAM. It's quite common for 50% of the RAM to be reserved, and this has to be set up at the time the OS is installed - you can't override it. Another problem is that other users might have some of the RAM. I suggest making the preload-size be quite modest, maybe 20GB, even if you think you have 100GB or more available. KrakenUniq still runs quite fast, and it shouldn't hang.

hazmup commented 1 year ago

I have 490GB available. When using --preload-size 100G it seems to work, but the run times out after 48h, which I think is not normal. For the smaller datasets with no preloading it completes in around 5 minutes.

JochenSchaefergmxde commented 1 year ago

Don't use --preload, but instead use --preload-size (this applies to @biolougy's question above). We probably should just deprecate the --preload option because the newer preload-size is just better. --preload still requires that you have enough RAM for the whole database. I don't know how your systems are configured, but many Unix systems are set up with substantial amounts of swap, so they can never access all of the RAM. It's quite common for 50% of the RAM to be reserved, and this has to be set up at the time the OS is installed - you can't override it. Another problem is that other users might have some of the RAM. I suggest making the preload-size be quite modest, maybe 20GB, even if you think you have 100GB or more available. KrakenUniq still runs quite fast, and it shouldn't hang.

If i use preloade-size 80GB (i have 128GB) the temporary files are written to my slow Harddisk (I write the Report and output to these HD) and not to my M2 or RAM ? How could i change it to speed up ? krakenuniq --preload-size 80GB -db /mnt/m2/kuniqdb/kuniq_standard_plus_eupath_minus_kdb --threads 32 --exact --output /mnt/sdc1/dp/2023_04_25_09_39_33_dp_kuniq_stant2t2_O.txt --report-file /mnt/sdc1/dp/2023_04_25_09_39_33_dp_kuniq_stant2t_R.txt /mnt/m2/fastq/dp_t2t_u.fastq.gz

in /mnt/sdc1 you could find (Transfer only about 50Mb/s but 18TB) -rw-rw-r-- 1 internet internet 272G Apr 26 11:28 tmp9mO4Mi.prev -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.9 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.5 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.30 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.0 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.16 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.31 -rw-rw-r-- 1 internet internet 8,7G Apr 26 11:56 tmp9mO4Mi.20 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.8 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.28 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.15 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.11 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.29 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.21 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.13 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.24 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.23 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.2 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.10 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.12 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.18 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.26 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.17 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.3 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.7 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.19 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.6 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.22 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.14 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.1 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.4 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.25 -rw-rw-r-- 1 internet internet 8,6G Apr 26 11:56 tmp9mO4Mi.27 -rw-rw-r-- 1 internet internet 67G Apr 26 12:29 tmp9mO4Mi

salzberg commented 1 year ago

I don't recommend such a large value for preload-size (80GB). Then it has to write a large file to your hard disk, which you say is slow. Use 20GB instead - it will have to cycle through the data more times, but the file it's writing will be far smaller, and I suspect this will end up being faster (depending on properties of your hard disk and OS). The temp files have to go on disk no matter what - it has to store results somewhere each time it cycles through the input reads.

JochenSchaefergmxde commented 1 year ago

I don't recommend such a large value for preload-size (80GB). Then it has to write a large file to your hard disk, which you say is slow. Use 20GB instead - it will have to cycle through the data more times, but the file it's writing will be far smaller, and I suspect this will end up being faster (depending on properties of your hard disk and OS). The temp files have to go on disk no matter what - it has to store results somewhere each time it cycles through the input reads.

Thanks for your recommendation. With 80GB i had also the Problem that in cycle 4 only 1 Thread is used from 32. I would try it with 20GB (Y) The temporary files must be stored at a harddsik yes, but i have also M2 and there is no possibility to define where to store the temp files and it is a big difference 50Mb/s or 5000 Mb/s . It looks like that the temp files get stored in the same directory like the reports and output files. It would be easier to define the temp file place.

JochenSchaefergmxde commented 1 year ago

I have 490GB available. When using --preload-size 100G it seems to work, but the run times out after 48h, which I think is not normal. For the smaller datasets with no preloading it completes in around 5 minutes.

I had the same Problem, if you use preload-size 80 GB the last turn is only running at 1 thread !!! and this is much much slower ! If you take only 20GB it is faster because the last turn is less data to calculate at 1 thread ! So if you took a really small amount to cycle with ?32 Threads it must be faster because the last turn is smaller !

JochenSchaefergmxde commented 1 year ago

one solution for the temporary files is to set some TMP =/mnt/... TEMP TMPDIR variables in

/etc/environment and activate it with

set -a; source /etc/environment; set +a; I would recommend to set the TMP Variable to a very fast memory like M2 SSD !

and if you get out of memory errors in syslog at the last time when the output-file is written, try to install krakenuniq again. In my case i had a classify file from kraken1 in the Path.