DerrickWood / kraken

Kraken taxonomic sequence classification system
http://ccb.jhu.edu/software/kraken/
GNU General Public License v3.0
214 stars 103 forks source link

Functionality to join two kraken output files together #130

Open tayabsoomro opened 6 years ago

tayabsoomro commented 6 years ago

Hi, I am wondering if there is a functionality to join two kraken output files together?

Thanks, Tayab Soomro.

jenniferlu717 commented 6 years ago

What kind of outputs are you trying to combine? Are they the single sample and you want the report for the sum of the two? Or are you trying to compare them?

tayabsoomro commented 6 years ago

So, I have two kraken-style output files generated by performing two DNA classification runs. These outputs show the proportions of reads present in the sample, an example is shown just below:

 6.00  600     600     U       0       unclassified
 94.00  9400    0       -       1       root
 94.00  9400    0       -       131567    cellular organisms
 94.00  9400    0       D       2           Bacteria
 94.00  9400    0       -       1783272       Terrabacteria group
 94.00  9400    0       P       1239            Firmicutes
 94.00  9400    0       C       91061             Bacilli
 94.00  9400    0       O       1385                Bacillales
 94.00  9400    0       F       186817                Bacillaceae
 94.00  9400    0       G       1386                    Bacillus
 94.00  9400    0       S       86661                     Bacillus cereus group
 94.00  9400    3463    S       1392                        Bacillus anthracis
 58.71  5871    5871    S       198094200                             B.anthracis Ames
0.66  66      66      S       191218100                               B.anthracis A2012

Now, imagine the second kraken-style report file having some overlapping species present, and some different species. I would like to generate final kraken-style report from the two previous ones which merges the two data together.

So, for example if there is B. anthracis Ames in the 2nd kraken-style report as well, then it would show it only once in the final kraken-style report with the proportions increased. But if there is another strain in the 2nd kraken-style report under Bacillus anthracis which is not present in the 1st kraken-style report, the final kraken-style report would add that under Bacillus anthracis and update the proportions accordingly.

jenniferlu717 commented 5 years ago

@tayabsoomro I know this is very late to say but we are working on a set of "Kraken-Tools" that can/will provide additional support for such projects as this.

tayabsoomro commented 5 years ago

That is good to hear! Although I ended up creating such a tool myself but it will be great if it is added to Kraken. Thanks.

susheelbhanu commented 5 years ago

Hey @tayabsoomro. I'm interested in doing something similar, so could you please share this tool you're referring to?

Thank you!

tayabsoomro commented 5 years ago

Hey @tayabsoomro. I'm interested in doing something similar, so could you please share this tool you're referring to? Thank you!

I ended up using the Centrifuge tool and its command centrifuge-kreport to generate the kraken-style report.

So I combined the multiple centrifuge reports together using python's file append and then once I had the accumulated centrifuge report file, I generated kraken-style report file from it.

Here is the snippet of code that I created, hope it helps:

https://github.com/coadunate/MICAS/blob/24db33140419219320ebf6d230e4519894f1bc2d/server/app/main/utils/FASTQFileHandler.py#L58-L83