caozhichongchong / arg_ranker

MIT License
23 stars 11 forks source link

support gzipped fastq files #16

Closed yqy6611 closed 7 months ago

yqy6611 commented 10 months ago

Hi all,

May I know if it's accessible to scan gzipped fastq files via arg_ranker as my metagenomic reads are all quite large. Many thanks!

caozhichongchong commented 10 months ago

Hi,

Thank you for reaching out! The arg_ranker should support gzipped fastq files now. To update it, please run the command _pip install argranker. If you encounter any issues, please don't hesitate to contact me.

Best regards, Anni

yqy6611 commented 10 months ago

Hi Anni,

Thanks for your quick reply. Another suggestion is whether arg_ranker will support pair-end reads as kraken2 and microbesensus support pair-end reads. For diamond results, can merge results from two paired read files like how arg-oap works? Many thanks!

caozhichongchong commented 10 months ago

Thank you for your suggestion! I'll try to make it work soon and will keep you posted :)

Best regards, Anni

yqy6611 commented 10 months ago

Thanks Anni!

caozhichongchong commented 10 months ago

I now updated arg_ranker 3.5 (_pip install argranker==3.5 --upgrade) which should support pair-end reads :) Could you please help me test it? Specifically, would you be able to run both versions 3.4 and 3.5 on the same sample with paired-end reads and share the results with me? Thank you!

Best regards, Anni

LeabaeL commented 8 months ago

Hi Anni,

I recently used the handy ARGs annotation software you developed and noticed that you added the ability to glue the results of double-ended files and support compression files in the new version. Here are the results I got using my own paired-ended test data in different versions.

It is worth noting that the number in the column "file_num_in_search_output" represents the number of files generated. According to the previous question results, I think the 7 output files are complete, and the letters represent which file of the paired-ended file the test file belongs to

You can find that there is a big problem with the annotation results when they are not decompressed. In addition, the annotation abundance results were similar for the three versions, but the total abundance calculation was very different, which was confusing because I checked the number of rows in the "arg_ranker/data" folder and there was no change in the three versions of the database.

I look forward to receiving your answers and thank you again for providing such a convenient annotation tool.

sample kraken arg_ranker diamond blast file_num_in_search_output missing file compressed Rank_I_per Rank_II_per Rank_III_per Rank_IV_per Unassessed_per Total_abu Rank_code
test_2 v 2.1.3_pluspfp 3.4 v 2.1.8 v 2.15.0 6R .kraken no❌ 1.2E-01 9.8E-02 1.4E-01 6.4E-01 0.0E+00 2.7E+04 5.5-6.4-0.7-0.8-0.0
test_1 v 2.1.3_pluspfp 3.4 v 2.1.8 v2.15.0 6F .kraken no❌ 1.2E-01 1.0E-01 1.3E-01 6.5E-01 0.0E+00 2.7E+04 5.5-6.5-0.7-0.8-0.0
test_2 v 2.1.3_pluspfp 3.4 v 2.1.8 v 2.15.0 7R yes✅ 0 0 0 0 0 0 0.0-0.0-0.0-0.0-0.0
test_1 v 2.1.3_pluspfp 3.4 v 2.1.8 v 2.15.0 7F yes✅ 0 0 0 0 0 0 0.0-0.0-0.0-0.0-0.0
test v 2.1.3_pluspfp 3.5 v 2.1.8 v 2.15.0 6F+3R F.AGS, R.AGS, R.blast.txt.filter, R.kraken, R.kraken.kreport no❌ 1.1E-01 9.9E-02 1.7E-01 6.2E-01 0.0E+00 2.3E-03 5.2-6.4-0.9-0.8-0.0
test v 2.1.3_pluspfp 3.5 v 2.1.8 v 2.15.0 6F+6R F.AGS, R.blast.txt.filter yes✅ 0 0 0 0 0 0 0.0-0.0-0.0-0.0-0.0
test_2 v 2.1.3_pluspfp 3.3 v 2.1.8 v 2.15.0 7R no❌ 1.1E-01 9.9E-02 1.7E-01 6.2E-01 0.0E+00 7.3E+00 5.2-6.4-0.9-0.8-0.0
test_1 v 2.1.3_pluspfp 3.3 v 2.1.8 v 2.15.0 7F no❌ 1.1E-01 1.0E-01 1.7E-01 6.2E-01 0.0E+00 7.2E+00 5.2-6.6-0.8-0.8-0.0

Best regards, LeabaeL

caozhichongchong commented 8 months ago

Hi LeabaeL,

Thank you so much for your super valuable feedback!

I think arg_ranker (especially diamond and python) cannot handle compressed files. I have changed arg_ranker to directly decompress these files. The absence of a .kraken output in v3.4 likely led to inaccuracies in the Total_abu. The results from v3.5 for paired-end samples without compression seem to be accurate. To clarify the output, I've released v3.6, which will generate an AGS, a blast.txt.filter, and a kraken report for paired-end samples.

Please let me know if you have any questions! Super appreciate your help and support!

Best regards, Anni

LeabaeL commented 8 months ago

Hello Anni,

thank you very much for your prompt updates. I've tested the 3.6 version using both test data and the example data provided with the software. I've noticed that the independent calculation results for paired-end samples in the 3.6 version remain consistent with the 3.4 version. However, the difference in magnitudes might be due to the fact that the results calculated in the table represent the count of reads matched to ARGs, rather than normalized based on cell numbers. This might be causing the issue of extremely high total abundance.

The results are illustrated in the table below.

sample kraken arg_ranker diamond blast file_num_in_search_output compressed Rank_I_per Rank_II_per Rank_III_per Rank_IV_per Unassessed_per Total_abu Rank_code
test_1 v 2.1.3_pluspfp 3.3 v 2.1.8 v 2.15.0 7F no❌ 1.1E-01 1.0E-01 1.7E-01 6.2E-01 0.0E+00 7.2E+00 5.2-6.6-0.8-0.8-0.0
test_2 v 2.1.3_pluspfp 3.3 v 2.1.8 v 2.15.0 7R no❌ 1.1E-01 9.9E-02 1.7E-01 6.2E-01 0.0E+00 7.3E+00 5.2-6.4-0.9-0.8-0.0
test_1 v 2.1.3_pluspfp 3.4 v 2.1.8 v2.15.0 6F no❌ 1.2E-01 1.0E-01 1.3E-01 6.5E-01 0.0E+00 2.7E+04 5.5-6.5-0.7-0.8-0.0
test_2 v 2.1.3_pluspfp 3.4 v 2.1.8 v 2.15.0 6R no❌ 1.2E-01 9.8E-02 1.4E-01 6.4E-01 0.0E+00 2.7E+04 5.5-6.4-0.7-0.8-0.0
test v 2.1.3_pluspfp 3.5 v 2.1.8 v 2.15.0 6F+3R no❌ 1.1E-01 9.9E-02 1.7E-01 6.2E-01 0.0E+00 2.3E-03 5.2-6.4-0.9-0.8-0.0
test_1 v 2.1.3_pluspfp 3.6 v 2.1.8 v 2.15.0 4F (without AGS) no❌ 1.2E-01 1.0E-01 1.3E-01 6.5E-01 0.0E+00 2.7E+04 5.5-6.5-0.7-0.8-0.0
test_2 v 2.1.3_pluspfp 3.6 v 2.1.8 v 2.15.0 4R (without AGS) no❌ 1.2E-01 9.8E-02 1.4E-01 6.4E-01 0.0E+00 2.7E+04 5.5-6.4-0.7-0.8-0.0
test v 2.1.3_pluspfp 3.6 v 2.1.8 v 2.15.0 4 no❌ 1.1E-01 1.0E-01 1.7E-01 6.2E-01 0.0E+00 1.1E+01 5.2-6.5-0.8-0.8-0.0
WEE300 v 2.1.3_pluspfp 3.3 v 2.1.8 v 2.15.0 7 no❌ 6.6E-02 2.1E-02 2.2E-01 6.9E-01 0.0E+00 4.7E+00 3.1-1.3-1.1-0.9-0.0
WEE300 v 2.1.3_pluspfp 3.4 v 2.1.8 v 2.15.0 5 no❌ 4.9E-02 1.5E-02 1.5E-01 7.8E-01 0.0E+00 1.1E+03 2.3-1.0-0.8-1.0-0.0
WEE300 v 2.1.3_pluspfp 3.5 v 2.1.8 v 2.15.0 7 no❌ 6.6E-02 2.1E-02 2.2E-01 6.9E-01 0.0E+00 4.7E+00 3.1-1.3-1.1-0.9-0.0
WEE300 v 2.1.3_pluspfp 3.6 v 2.1.8 v 2.15.0 7 no❌ 6.6E-02 2.1E-02 2.2E-01 6.9E-01 0.0E+00 4.7E+00 3.1-1.3-1.1-0.9-0.0
EsCo v 2.1.3_pluspfp 3.3 v 2.1.8 v 2.15.0 2 no❌ 7.1E-02 0.0E+00 2.1E-01 7.1E-01 0.0E+00 1.4E+01 3.3-0.0-1.1-0.9-0.0
EsCo v 2.1.3_pluspfp 3.4 v 2.1.8 v 2.15.0 2 no❌ 7.1E-02 0.0E+00 2.1E-01 7.1E-01 0.0E+00 1.4E+01 3.3-0.0-1.1-0.9-0.0
EsCo v 2.1.3_pluspfp 3.5 v 2.1.8 v 2.15.0 2 no❌ 7.1E-02 0.0E+00 2.1E-01 7.1E-01 0.0E+00 1.4E+01 3.3-0.0-1.1-0.9-0.0
EsCo v 2.1.3_pluspfp 3.6 v 2.1.8 v 2.15.0 2 no❌ 7.1E-02 0.0E+00 2.1E-01 7.1E-01 0.0E+00 1.4E+01 3.3-0.0-1.1-0.9-0.0

Thank you again for your help, Anni. I appreciate your support.

Best regards, LeabaeL

caozhichongchong commented 8 months ago

Hi LeabaeL,

Thank you for your fast feedback! I think paired-end samples were incorrectly paired even though metadata indicates they should be treated as independent. It should now be fixed in v3.7 :) It's odd that v3.6 only generated 4 files for test samples test v 2.1.3_pluspfp 3.6 v 2.1.8 v 2.15.0 4

Could you please verify if the following files were generated: test.kraken, test.kraken.kreport, test.AGS.txt, and test.blast.txt.filter? Also, are these files not empty? Super appreciate it!

Best regards, Anni

LeabaeL commented 8 months ago

Hi Anni,

Thanks for the rapid response!

This is the previous v3.6 generated file, all file counts are by prefix only.

image

And the newest version results are as followed

1704124494540

v 3.7

1704124725405

Best regards, LeabaeL

caozhichongchong commented 8 months ago

Hi LeabaeL,

Thank you for the information!

The output appears correct. However, it's weird that the Total_abu of test is the sum of test_1 and test_2 (not an average).

I can examine it further if you would like :) It would be super helpful to provide the few lines of AGS.txt (cat *.AGS.txt) and kraken.kreport (grep 'Bacteria' *.kreport) for test, test_1, and test_2.

Happy New Year! Anni

LeabaeL commented 8 months ago

Hi Anni,

Happy New Year! Thank you again for your reply!

I checked the results in v 3.7 search_output, however, there are no test results (only test_1 and test_2).

v 3.7 example:

**grep 'Bacteria' *.kreport**
 39.62  490333  2169    D   2       Bacteria
  0.00  18  0   D1  2323          Bacteria incertae sedis
  0.00  18  0   D2  1783234         Bacteria candidate phyla

**cat *.AGS.txt:**
Parameters
metagenome: /backup/test/example37/WEE300_all-trimmed-decont_1.fastq
reads_sampled:  1184884
trimmed_length: 100
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    5504511.121548271
total_bases:    124275596
genome_equivalents: 22.57704512822287

**ll**:
-rw-rw-r--. 1 guest guest  2232536 1月   1 22:52 EsCo_genome.fasta.blast.txt
-rw-rw-r--. 1 guest guest     1573 1月   1 22:52 EsCo_genome.fasta.blast.txt.filter
-rw-rw-r--. 1 guest guest      308 1月   1 22:50 WEE300_all-trimmed-decont_1.fastq.AGS.txt
-rw-rw-r--. 1 guest guest 11395149 1月   1 22:51 WEE300_all-trimmed-decont_1.fastq.blast.txt
-rw-rw-r--. 1 guest guest   124262 1月   1 22:51 WEE300_all-trimmed-decont_1.fastq.blast.txt.filter
-rw-rw-r--. 1 guest guest   130922 1月   1 22:44 WEE300_all-trimmed-decont_1.fastq.diamond.txt
-rw-rw-r--. 1 guest guest   185100 1月   1 22:44 WEE300_all-trimmed-decont_1.fastq.diamond.txt.aa
-rw-rw-r--. 1 guest guest 85625469 1月   1 22:47 WEE300_all-trimmed-decont_1.fastq.kraken
-rw-rw-r--. 1 guest guest   605071 1月   1 22:47 WEE300_all-trimmed-decont_1.fastq.kraken.kreport
-rw-rw-r--. 1 guest guest        0 1月   1 22:58 WEE300_all-trimmed-decont_1.fastq.zip.blast.txt
-rw-rw-r--. 1 guest guest        0 1月   1 22:58 WEE300_all-trimmed-decont_1.fastq.zip.blast.txt.filter
-rw-rw-r--. 1 guest guest        0 1月   1 22:52 WEE300_all-trimmed-decont_1.fastq.zip.diamond.txt.aa

v 3.7 test:

**grep 'Bacteria' *.kreport**
test_1.fq.kraken.kreport: 26.58 7160704 258364  D   2       Bacteria
test_1.fq.kraken.kreport:  0.01 1485    0   D1  2323          Bacteria incertae sedis
test_1.fq.kraken.kreport:  0.01 1485    0   D2  1783234         Bacteria candidate phyla
test_1.fq.kraken.kreport:  0.00 131 0   D1  49928         unclassified Bacteria
test_2.fq.kraken.kreport: 26.51 7140274 258165  D   2       Bacteria
test_2.fq.kraken.kreport:  0.01 1381    0   D1  2323          Bacteria incertae sedis
test_2.fq.kraken.kreport:  0.01 1381    1   D2  1783234         Bacteria candidate phyla
test_2.fq.kraken.kreport:  0.00 145 0   D1  49928         unclassified Bacteria

**cat *.AGS.txt:**
Parameters
metagenome: /backup/test/37/test_1.fq
reads_sampled:  2000000
trimmed_length: 150
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    3583202.382139245
total_bases:    4040427600
genome_equivalents: 1127.6023983852629

Parameters
metagenome: /backup/test/37/test_2.fq
reads_sampled:  2000000
trimmed_length: 150
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    3636998.2031177795
total_bases:    4040427600
genome_equivalents: 1110.9237273024728

**ll**:
-rw-rw-r--. 1 guest guest        277 1月   1 23:17 test_1.fq.AGS.txt
-rw-rw-r--. 1 guest guest  265867221 1月   1 23:18 test_1.fq.blast.txt
-rw-rw-r--. 1 guest guest    2991596 1月   1 23:18 test_1.fq.blast.txt.filter
-rw-rw-r--. 1 guest guest    2855730 1月   1 22:58 test_1.fq.diamond.txt
-rw-rw-r--. 1 guest guest    5452028 1月   1 23:02 test_1.fq.diamond.txt.aa
-rw-rw-r--. 1 guest guest 1951244733 1月   1 23:07 test_1.fq.kraken
-rw-rw-r--. 1 guest guest    1037327 1月   1 23:07 test_1.fq.kraken.kreport
-rw-rw-r--. 1 guest guest        278 1月   1 23:43 test_2.fq.AGS.txt
-rw-rw-r--. 1 guest guest  259519913 1月   1 23:44 test_2.fq.blast.txt
-rw-rw-r--. 1 guest guest    2984383 1月   1 23:44 test_2.fq.blast.txt.filter
-rw-rw-r--. 1 guest guest    2845180 1月   1 23:23 test_2.fq.diamond.txt
-rw-rw-r--. 1 guest guest    5433091 1月   1 23:28 test_2.fq.diamond.txt.aa
-rw-rw-r--. 1 guest guest 1950904506 1月   1 23:33 test_2.fq.kraken
-rw-rw-r--. 1 guest guest    1035087 1月   1 23:33 test_2.fq.kraken.kreport

Just in case you need it, the results from v3.3 onwards are also shown below

v 3.3 example:

**grep 'Bacteria' *.kreport**
 39.62  490333  2169    D   2       Bacteria
  0.00  18  0   D1  2323          Bacteria incertae sedis
  0.00  18  0   D2  1783234         Bacteria candidate phyla

**cat *.AGS.txt:**
Parameters
metagenome: /backup/test/example33/WEE300_all-trimmed-decont_1.fastq
reads_sampled:  1184884
trimmed_length: 100
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    5504511.121548271
total_bases:    124275596
genome_equivalents: 22.57704512822287

**ll**:
-rw-rw-r--. 1 guest guest  2232536 12月 31 01:16 EsCo_genome.fasta.blast.txt
-rw-rw-r--. 1 guest guest     1573 12月 31 01:16 EsCo_genome.fasta.blast.txt.filter
-rw-rw-r--. 1 guest guest      308 12月 31 01:30 WEE300_all-trimmed-decont_1.fastq.AGS.txt
-rw-rw-r--. 1 guest guest 11395149 12月 31 01:30 WEE300_all-trimmed-decont_1.fastq.blast.txt
-rw-rw-r--. 1 guest guest   124262 12月 31 01:30 WEE300_all-trimmed-decont_1.fastq.blast.txt.filter
-rw-rw-r--. 1 guest guest   130922 12月 31 01:16 WEE300_all-trimmed-decont_1.fastq.diamond.txt
-rw-rw-r--. 1 guest guest   185100 12月 31 01:16 WEE300_all-trimmed-decont_1.fastq.diamond.txt.aa
-rw-rw-r--. 1 guest guest 85625469 12月 31 01:26 WEE300_all-trimmed-decont_1.fastq.kraken
-rw-rw-r--. 1 guest guest   605071 12月 31 01:27 WEE300_all-trimmed-decont_1.fastq.kraken.kreport

v 3.3 test:

**grep 'Bacteria' *.kreport**
test_1.fq.kraken.kreport: 26.58 7160704 258364  D   2       Bacteria
test_1.fq.kraken.kreport:  0.01 1485    0   D1  2323          Bacteria incertae sedis
test_1.fq.kraken.kreport:  0.01 1485    0   D2  1783234         Bacteria candidate phyla
test_1.fq.kraken.kreport:  0.00 131 0   D1  49928         unclassified Bacteria
test_2.fq.kraken.kreport: 26.51 7140274 258165  D   2       Bacteria
test_2.fq.kraken.kreport:  0.01 1381    0   D1  2323          Bacteria incertae sedis
test_2.fq.kraken.kreport:  0.01 1381    1   D2  1783234         Bacteria candidate phyla
test_2.fq.kraken.kreport:  0.00 145 0   D1  49928         unclassified Bacteria

**cat *.AGS.txt:**
Parameters
metagenome: /backup/test/33/test_1.fq
reads_sampled:  2000000
trimmed_length: 150
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    3583202.382139245
total_bases:    4040427600
genome_equivalents: 1127.6023983852629

Parameters
metagenome: /backup/test/33/test_2.fq
reads_sampled:  2000000
trimmed_length: 150
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    3636998.2031177795
total_bases:    4040427600
genome_equivalents: 1110.9237273024728

**ll**:
-rw-rw-r--. 1 guest guest        277 12月 26 03:44 test_1.fq.AGS.txt
-rw-rw-r--. 1 guest guest  265867221 12月 26 03:44 test_1.fq.blast.txt
-rw-rw-r--. 1 guest guest    2991596 12月 26 03:45 test_1.fq.blast.txt.filter
-rw-rw-r--. 1 guest guest    2855730 12月 26 03:26 test_1.fq.diamond.txt
-rw-rw-r--. 1 guest guest    5452028 12月 26 03:32 test_1.fq.diamond.txt.aa
-rw-rw-r--. 1 guest guest 1951244733 12月 26 03:34 test_1.fq.kraken
-rw-rw-r--. 1 guest guest    1037327 12月 26 03:35 test_1.fq.kraken.kreport
-rw-rw-r--. 1 guest guest        278 12月 26 04:07 test_2.fq.AGS.txt
-rw-rw-r--. 1 guest guest  259519913 12月 26 04:07 test_2.fq.blast.txt
-rw-rw-r--. 1 guest guest    2984383 12月 26 04:07 test_2.fq.blast.txt.filter
-rw-rw-r--. 1 guest guest    2845180 12月 26 03:48 test_2.fq.diamond.txt
-rw-rw-r--. 1 guest guest    5433091 12月 26 03:54 test_2.fq.diamond.txt.aa
-rw-rw-r--. 1 guest guest 1950904506 12月 26 03:57 test_2.fq.kraken
-rw-rw-r--. 1 guest guest    1035087 12月 26 03:57 test_2.fq.kraken.kreport

v 3.4 example:

**grep 'Bacteria' *.kreport**
WEE300_all-trimmed-decont_1.fastq.zip.kraken.kreport: 39.62 490333  2169    D   2       Bacteria
WEE300_all-trimmed-decont_1.fastq.zip.kraken.kreport:  0.00 18  0   D1  2323          Bacteria incertae sedis
WEE300_all-trimmed-decont_1.fastq.zip.kraken.kreport:  0.00 18  0   D2  1783234         Bacteria candidate phyla

**cat *.AGS.txt:**
Parameters
metagenome: /backup/test/example34/WEE300_all-trimmed-decont_1.fastq
reads_sampled:  1184884
trimmed_length: 100
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    5504511.121548271
total_bases:    124275596
genome_equivalents: 22.57704512822287

**ll**:
-rw-rw-r--. 1 guest guest  2232536 12月 31 01:40 EsCo_genome.fasta.blast.txt
-rw-rw-r--. 1 guest guest     1573 12月 31 01:40 EsCo_genome.fasta.blast.txt.filter
-rw-rw-r--. 1 guest guest      308 12月 31 01:45 WEE300_all-trimmed-decont_1.fastq.AGS.txt
-rw-rw-r--. 1 guest guest 11395149 12月 31 01:46 WEE300_all-trimmed-decont_1.fastq.blast.txt
-rw-rw-r--. 1 guest guest   124262 12月 31 01:46 WEE300_all-trimmed-decont_1.fastq.blast.txt.filter
-rw-rw-r--. 1 guest guest   130922 12月 31 01:40 WEE300_all-trimmed-decont_1.fastq.diamond.txt
-rw-rw-r--. 1 guest guest   185100 12月 31 01:41 WEE300_all-trimmed-decont_1.fastq.diamond.txt.aa
-rw-rw-r--. 1 guest guest        0 12月 31 01:42 WEE300_all-trimmed-decont_1.fastq.kraken.kreport
-rw-rw-r--. 1 guest guest        0 12月 31 01:47 WEE300_all-trimmed-decont_1.fastq.zip.blast.txt
-rw-rw-r--. 1 guest guest        0 12月 31 01:47 WEE300_all-trimmed-decont_1.fastq.zip.blast.txt.filter
-rw-rw-r--. 1 guest guest        0 12月 31 01:46 WEE300_all-trimmed-decont_1.fastq.zip.diamond.txt.aa
-rw-rw-r--. 1 guest guest 85625469 12月 31 01:47 WEE300_all-trimmed-decont_1.fastq.zip.kraken
-rw-rw-r--. 1 guest guest   605071 12月 31 01:47 WEE300_all-trimmed-decont_1.fastq.zip.kraken.kreport

v 3.4 test:

**grep 'Bacteria' *.kreport**

**cat *.AGS.txt:**
Parameters
metagenome: /backup/test/34/test_1.fq
reads_sampled:  2000000
trimmed_length: 150
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    3583202.382139245
total_bases:    4040427600
genome_equivalents: 1127.6023983852629

Parameters
metagenome: /backup/test/34/test_2.fq
reads_sampled:  2000000
trimmed_length: 150
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    3636998.2031177795
total_bases:    4040427600
genome_equivalents: 1110.9237273024728

**ll**:
-rw-rw-r--. 1 guest guest       277 12月 25 20:13 test_1.fq.AGS.txt
-rw-rw-r--. 1 guest guest 265867221 12月 25 20:14 test_1.fq.blast.txt
-rw-rw-r--. 1 guest guest   2991596 12月 25 20:14 test_1.fq.blast.txt.filter
-rw-rw-r--. 1 guest guest   2855730 12月 25 19:55 test_1.fq.diamond.txt
-rw-rw-r--. 1 guest guest   5452028 12月 25 19:59 test_1.fq.diamond.txt.aa
-rw-rw-r--. 1 guest guest         0 12月 25 20:04 test_1.fq.kraken.kreport
-rw-rw-r--. 1 guest guest       278 12月 25 20:34 test_2.fq.AGS.txt
-rw-rw-r--. 1 guest guest 259519913 12月 25 20:34 test_2.fq.blast.txt
-rw-rw-r--. 1 guest guest   2984383 12月 25 20:35 test_2.fq.blast.txt.filter
-rw-rw-r--. 1 guest guest   2845180 12月 25 20:18 test_2.fq.diamond.txt
-rw-rw-r--. 1 guest guest   5433091 12月 25 20:22 test_2.fq.diamond.txt.aa
-rw-rw-r--. 1 guest guest         0 12月 25 20:25 test_2.fq.kraken.kreport

v 3.5 example:

**grep 'Bacteria' *.kreport**
 39.62  490333  2169    D   2       Bacteria
  0.00  18  0   D1  2323          Bacteria incertae sedis
  0.00  18  0   D2  1783234         Bacteria candidate phyla

**cat *.AGS.txt:**
Parameters
metagenome: /backup/test/example35/WEE300_all-trimmed-decont_1.fastq
reads_sampled:  1184884
trimmed_length: 100
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    5504511.121548271
total_bases:    124275596
genome_equivalents: 22.57704512822287

**ll**:
-rw-rw-r--. 1 guest guest  2232536 12月 31 01:30 EsCo_genome.fasta.blast.txt
-rw-rw-r--. 1 guest guest     1573 12月 31 01:31 EsCo_genome.fasta.blast.txt.filter
-rw-rw-r--. 1 guest guest      308 12月 31 01:36 WEE300_all-trimmed-decont_1.fastq.AGS.txt
-rw-rw-r--. 1 guest guest 11395149 12月 31 01:36 WEE300_all-trimmed-decont_1.fastq.blast.txt
-rw-rw-r--. 1 guest guest   124262 12月 31 01:36 WEE300_all-trimmed-decont_1.fastq.blast.txt.filter
-rw-rw-r--. 1 guest guest   130922 12月 31 01:31 WEE300_all-trimmed-decont_1.fastq.diamond.txt
-rw-rw-r--. 1 guest guest   185100 12月 31 01:31 WEE300_all-trimmed-decont_1.fastq.diamond.txt.aa
-rw-rw-r--. 1 guest guest 85625469 12月 31 01:32 WEE300_all-trimmed-decont_1.fastq.kraken
-rw-rw-r--. 1 guest guest   605071 12月 31 01:33 WEE300_all-trimmed-decont_1.fastq.kraken.kreport
-rw-rw-r--. 1 guest guest        0 12月 31 01:38 WEE300_all-trimmed-decont_1.fastq.zip.blast.txt
-rw-rw-r--. 1 guest guest        0 12月 31 01:38 WEE300_all-trimmed-decont_1.fastq.zip.blast.txt.filter
-rw-rw-r--. 1 guest guest        0 12月 31 01:36 WEE300_all-trimmed-decont_1.fastq.zip.diamond.txt.aa

v 3.5 test:

**grep 'Bacteria' *.kreport**
 35.25  9494883 412479  D   2       Bacteria
  0.01  2587    0   D1  2323          Bacteria incertae sedis
  0.01  2587    2   D2  1783234         Bacteria candidate phyla
  0.00  213 0   D1  49928         unclassified Bacteria

**ll**:
-rw-rw-r--. 1 guest guest        199 1月   3 14:19 35.txt
-rw-rw-r--. 1 guest guest  259519913 12月 25 19:27 test_2.fq.blast.txt
-rw-rw-r--. 1 guest guest    2845180 12月 25 19:19 test_2.fq.diamond.txt
-rw-rw-r--. 1 guest guest    5433091 12月 25 19:23 test_2.fq.diamond.txt.aa
-rw-rw-r--. 1 guest guest          0 12月 25 19:26 BF0.blast.txt
-rw-rw-r--. 1 guest guest    2984383 12月 25 19:27 BF0.blast.txt.filter
-rw-rw-r--. 1 guest guest    2855730 12月 25 19:16 BF0.diamond.txt
-rw-rw-r--. 1 guest guest          0 12月 25 19:19 BF0.diamond.txt.aa
-rw-rw-r--. 1 guest guest 2789756641 12月 25 19:25 BF0.kraken
-rw-rw-r--. 1 guest guest    1117784 12月 25 19:26 BF0.kraken.kreport

v 3.6 example:

**grep 'Bacteria' *.kreport**
 39.62  490333  2169    D   2       Bacteria
  0.00  18  0   D1  2323          Bacteria incertae sedis
  0.00  18  0   D2  1783234         Bacteria candidate phyla

**cat *.AGS.txt:**
Parameters
metagenome: /backup/test/example36/WEE300_all-trimmed-decont_1.fastq
reads_sampled:  1184884
trimmed_length: 100
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    5504511.121548271
total_bases:    124275596
genome_equivalents: 22.57704512822287

**ll**:
-rw-rw-r--. 1 guest guest  2232536 12月 31 01:33 EsCo_genome.fasta.blast.txt
-rw-rw-r--. 1 guest guest     1573 12月 31 01:33 EsCo_genome.fasta.blast.txt.filter
-rw-rw-r--. 1 guest guest      308 12月 31 01:38 WEE300_all-trimmed-decont_1.fastq.AGS.txt
-rw-rw-r--. 1 guest guest 11395149 12月 31 01:38 WEE300_all-trimmed-decont_1.fastq.blast.txt
-rw-rw-r--. 1 guest guest   124262 12月 31 01:38 WEE300_all-trimmed-decont_1.fastq.blast.txt.filter
-rw-rw-r--. 1 guest guest   130922 12月 31 01:33 WEE300_all-trimmed-decont_1.fastq.diamond.txt
-rw-rw-r--. 1 guest guest   185100 12月 31 01:33 WEE300_all-trimmed-decont_1.fastq.diamond.txt.aa
-rw-rw-r--. 1 guest guest 85625469 12月 31 01:34 WEE300_all-trimmed-decont_1.fastq.kraken
-rw-rw-r--. 1 guest guest   605071 12月 31 01:35 WEE300_all-trimmed-decont_1.fastq.kraken.kreport
-rw-rw-r--. 1 guest guest        0 12月 31 01:40 WEE300_all-trimmed-decont_1.fastq.zip.blast.txt
-rw-rw-r--. 1 guest guest        0 12月 31 01:40 WEE300_all-trimmed-decont_1.fastq.zip.blast.txt.filter
-rw-rw-r--. 1 guest guest        0 12月 31 01:38 WEE300_all-trimmed-decont_1.fastq.zip.diamond.txt.aa

v 3.6 test:

**grep 'Bacteria' *.kreport**
 35.25  9494883 412479  D   2       Bacteria
  0.01  2587    0   D1  2323          Bacteria incertae sedis
  0.01  2587    2   D2  1783234         Bacteria candidate phyla
  0.00  213 0   D1  49928         unclassified Bacteria

**cat *.AGS.txt:**
Parameters
metagenome: /backup/test/36/test_1.fq,/backup/test/36/test_2.fq
reads_sampled:  2000000
trimmed_length: 150
min_quality:    -5
mean_quality:   -5
filter_dups:    False
max_unknown:    100

Results
average_genome_size:    3583202.382139245
total_bases:    8080855200
genome_equivalents: 2255.2047967705257

**ll**:
-rw-rw-r--. 1 guest guest  265867221 12月 30 01:19 test_1.fq.blast.txt
-rw-rw-r--. 1 guest guest    2991596 12月 30 01:20 test_1.fq.blast.txt.filter
-rw-rw-r--. 1 guest guest    2855730 12月 30 00:50 test_1.fq.diamond.txt
-rw-rw-r--. 1 guest guest    5452028 12月 30 00:58 test_1.fq.diamond.txt.aa
-rw-rw-r--. 1 guest guest  259519913 12月 30 01:20 test_2.fq.blast.txt
-rw-rw-r--. 1 guest guest    2984383 12月 30 01:20 test_2.fq.blast.txt.filter
-rw-rw-r--. 1 guest guest    2845180 12月 30 00:54 test_2.fq.diamond.txt
-rw-rw-r--. 1 guest guest    5433091 12月 30 01:02 test_2.fq.diamond.txt.aa
-rw-rw-r--. 1 guest guest        308 12月 30 01:18 BF0.AGS.txt
-rw-rw-r--. 1 guest guest    5975979 12月 30 01:20 BF0.blast.txt.filter
-rw-rw-r--. 1 guest guest 2789756641 12月 30 01:08 BF0.kraken
-rw-rw-r--. 1 guest guest    1117784 12月 30 01:09 BF0.kraken.kreport

Best regards, LeabaeL

caozhichongchong commented 8 months ago

Hi LeabaeL,

Thank you for providing more information!

Kraken seems to consider paired-end reads as the same DNA fragments and count them only once, according to its manual. test_1.fq.kraken.kreport: 7,160,704 fragments classified as Bacteria. test_2.fq.kraken.kreport: 7,140,274 fragments classified as Bacteria. Combined (test_1.fq and test_2.fq.kraken.kreport): 9,494,883 fragments classified as Bacteria.

It's interesting that Kraken paired approximately 2 million reads, leaving around 5 million unpaired. According to this discussion, kraken might output wrong results if the samples were not given last in the command line. I have updated arg_ranker v3.7.2 to account for these problems, and hopefully, it might resolve this discrepancy.

I found an ongoing issue in kraken indicates a similar unresolved problem. One idea is that we can check a few examples in tests.kraken file of paired-end reads that were recognized and not recognized as pairs. What do you think?

Best regards, Anni

LeabaeL commented 8 months ago

Hi Anni,

Here are some Kraken file results below picture:

As you can see, Kraken does not identify correct paired reads. I think this might be the reason why Kraken outputs the wrong results.

image

And v 3.7.2 test results are as follows:

image

Best regards, LeabaeL

caozhichongchong commented 8 months ago

Hi LeabaeL,

Thank you for examining the Kraken results! Hmm, that's not good. I think it might be better to run arg_ranker by treating paired-end samples separately and subsequently average the results for Total_abu. What do you think?

Best regards, Anni

LeabaeL commented 8 months ago

Hi Anni,

Thank you for your quick reply, I think this is the best solution for paired-end sequencing files at the moment.

To make sure my understanding is correct, using the results of the previous v3.7.2 answer as an example, I need to calculate the mean Totalabu of the independently processed paired-end files. $$Total{abu} = mean(Total{abu_1} + Total{abu_2})$$

And obtain the relative abundance corresponding to the risk level by multiplying the percentage of independently processed results with the Totalabu, and similarly perform the mean calculation on these results to get the final results for one sample. $$Per{RankI} = mean(Per{rankI_1} \times Total{abu_1} + Per{rankI_2} \times Total{abu_2})$$

The final rank is calculated according to the following formula: $$\frac{Rank{abu}}{max(RKN, 1)}/\frac{Total{abu}}{sum RKN}$$

Since I'm not familiar with Python, I need to make sure I'm understanding the code correctly.

Also, when I look at the command line results, I see that the ".blast.txt.filter" file is a prerequisite for generating the final result file, so I'm not sure if that's correct.

Best regards, LeabaeL

caozhichongchong commented 8 months ago

Hi LeabaeL,

Good question! Technically, PerRankI should be computed as $$Per{RankI} =(Per_{rankI1} \times Total{abu1} + Per{rankI2} \times Total{abu2})/(Total{abu1}+Total{abu2})$$ Final rank should be computed using $$\frac{RankI{abu}}{No.RankI ARGs}/\frac{Total_{abu}}{No.ARGs}$$

However, I think both Per_RankI and final ranks of two paired-end samples are almost equivalent, thus an average calculation is generally adequate.

Yes, this calculation includes data from ".blast.txt.filter".

Best regards, Anni

LeabaeL commented 7 months ago

Anni, thanks for your patience and efficient response!

caozhichongchong commented 7 months ago

Thank you for your help and support :) Good luck with your research!