DaehwanKimLab / centrifuge

Classifier for metagenomic sequences
GNU General Public License v3.0
246 stars 73 forks source link

Issues with --nodc, --justref and --noref parameters #248

Open darko-cucin opened 1 year ago

darko-cucin commented 1 year ago

I have recently run centrifuge-build (version 1.0.4, docker container based on ubuntu 18:04) with sequences downloaded from refseq database. I have tested all the parameters, and for the 3 that are mentioned in the title, I am not sure if they work properly and for what use cases they can be useful. Then I ran centrifuge-build with examples files that are provided in the example folder of the centrifuge toolkit and the result was the same.

Commands that I ran are as follows:

  1. centrifuge-build --conversion-table gi_to_tid.dmp --taxonomy-tree nodes.dmp --name-table names.dmp --nodc test.fa test_index_nodc

This is stdout which the command line produced:

Settings: Output files: "test_index_nodc.*.cf" Line rate: 7 (line is 128 bytes) Lines per side: 1 (side is 128 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void*:8, int:4, long:8, size_t:8 Input files DNA, FASTA: test.fa Reading reference sizes Time reading reference sizes: 00:00:00 Calculating joined length Writing header Reserving space for joined string Joining reference sequences Time to join reference sequences: 00:00:00 bmax according to bmaxDivN setting: 268 Using parameters --bmax 201 and no difference cover\ Doing ahead-of-time memory usage test qemu: uncaught target signal 8 (Arithmetic exception) - core dumped Floating point exception

In this case, centrifuge-build produced 3 index files (1.cf and 2.cf are empty files) and after that when centrifuge was run there is no output report file because of the problem with the index files. Also, why Floating point exception error pops up when parameter --nodc is specified? Are there any use cases where output index files can be used for further analysis or this is expected behaviour?

  1. centrifuge-build --conversion-table gi_to_tid.dmp --taxonomy-tree nodes.dmp --name-table names.dmp --justref test.fa test_index_justref

This is stdout which the command line produced:

Settings: Output files: "test_index_justref.*.cf" Line rate: 7 (line is 128 bytes) Lines per side: 1 (side is 128 bytes) Offset rate: 4 (one in 16) FTable chars: 10 Strings: unpacked Local offset rate: 3 (one in 8) Local fTable chars: 6 Max bucket size: default Max bucket size, sqrt multiplier: default Max bucket size, len divisor: 4 Difference-cover sample period: 1024 Endianness: little Actual local endianness: little Sanity checking: disabled Assertions: disabled Random seed: 0 Sizeofs: void*:8, int:4, long:8, size_t:8 Input files DNA, FASTA: test.fa Reading reference sizes Time reading reference sizes: 00:00:00 Total time for call to driver() for forward index: 00:00:00

In this case, output index files were not produced. Are there any use cases where output files can be produced or this is expected behaviour?

  1. centrifuge-build --conversion-table gi_to_tid.dmp --taxonomy-tree nodes.dmp --name-table names.dmp --noref test.fa test_index_noref

In this case output index files were the same as files that were produced without --noref parameter. Are there any use cases where output index files are different from index files produced with default parameters or this is expected behaviour?

Thank you in advance.