cdiener / architeuthis

Tools to analyze and summarize data for Kraken2.
https://cdiener.github.io/architeuthis
Apache License 2.0
1 stars 0 forks source link

"filter" command is killed but writes output files for Bracken #4

Open zoey-rw opened 6 days ago

zoey-rw commented 6 days ago

I noticed that some of the "filter" runs for very large (20+ GB) files will be killed, but still produce output files that (seemingly?) work fine when passed Bracken. I don't think it's a memory limitation of the environment, because my Bash loop will continue and the filter command will execute successfully on another large file. Other than the std out message "Killed", the only clue was a warning if the 2+ GB filtered output file is read into R:

> incomplete = fread("/projectnb/frpmars/soil_microbe_db/NEON_metagenome_classification/02_bracken_output/TEAK_005-O-20210728-COMP_soil_microbe_db_filtered.output")
Avoidable 4.523 seconds. This file is very unusual: it ends abruptly without a final newline, and also its size is a multiple of 4096 bytes. Please properly end the last row with a newline using for example 'echo >> file' to avoid this  time to copy.

The filter command:

architeuthis mapping filter $KRAKEN_OUTPUT --db $DBDIR --data-dir $DB_taxonomy_dir --out $ARCHITEUTHIS_FILTERED 

The command line output:

2024/09/25 02:15:18 Processing 65276442 reads - Done.
2024/09/25 02:15:21 Pass 2: Score individuals reads...
2024/09/25 02:15:21 Reading k-mer assignments from /projectnb/frpmars/soil_microbe_db/NEON_metagenome_classification/01_kraken_output/TEAK_005-O-20210728-COMP_soil_microbe_db_kraken.output and writing to /projectnb/frpmars/soi
l_microbe_db/NEON_metagenome_classification/02_bracken_output/TEAK_005-O-20210728-COMP_soil_microbe_db_filtered.output.
2024/09/25 02:15:30 Processed 4000000 reads...
2024/09/25 02:15:38 Processed 7000000 reads...
2024/09/25 02:15:45 Processed 10000000 reads...
2024/09/25 02:16:02 Processed 17000000 reads...
2024/09/25 02:16:12 Processed 21000000 reads...
2024/09/25 02:16:41 Processed 33000000 reads...
2024/09/25 02:17:17 Processed 48000000 reads...
2024/09/25 02:17:44 Processed 59000000 reads...
2024/09/25 02:18:00 Processed 66000000 reads...
2024/09/25 02:18:17 Processed 73000000 reads...
2024/09/25 02:18:20 Processed 74000000 reads...
2024/09/25 02:18:24 Processed 76000000 reads...
2024/09/25 02:18:56 Processed 89000000 reads...
2024/09/25 02:19:46 Processed 110000000 reads...
2024/09/25 02:19:58 Processed 115000000 reads...
Killed

To find the incomplete files, I was able to borrow this bash function to print the any files without a newline:

function file_ends_with_newline() {
    [[ $(tail -c1 "$1" | wc -l) -gt 0 ]]
}

for samp_file in /projectnb/frpmars/soil_microbe_db/NEON_metagenome_classification/02_bracken_output/*_soil_microbe_db_filtered.output; do
if ! file_ends_with_newline $samp_file
then
    echo "$samp_file likely incomplete"
fi
done

In my case, this returned 9 files out of about 1400, so it is likely an edge case. A couple of the output files from the architeuthis "score" command were also returned. Not sure what the ideal behavior would be here (maybe adding a "complete" flag after writing? or maybe Bracken should be catching this when reading in files?). It seems like I can just re-run the problem samples, but wanted to flag it!

P.S. thanks for all your efforts to write/maintain scientific software. I am your number 1 fan.

cdiener commented 5 days ago

What is the return code of the program in those cases? If it is different from zero the easiest would be to check this in your script. That should also tell you what killed the process.

architeuthis uses very little RAM (<100M for filter) but some schedulers are not good in distinguishing cached files from memory usage and that can create issues with buffered IO. The logs would also tell you whether the file is complete. They always end in a line like

2024/09/26 15:30:56 Processed 9614861 reads - Done. 9129184/9614861 reads passed the filter.

P.S. thanks for all your efforts to write/maintain scientific software.

You're welcome! Glad the tool is useful.