HRGV / phyloFlash

phyloFlash - A pipeline to rapidly reconstruct the SSU rRNAs and explore phylogenetic composition of an illumina (meta)genomic dataset.
GNU General Public License v3.0
77 stars 25 forks source link

Taxonomic rank #112

Closed AmaliT closed 2 years ago

AmaliT commented 4 years ago

Hi

Thanks for the great tool; I was wondering if there is a possibility to regenerate phyloflash html report at a varying taxonomic rank like genus or family without having to re-run phyloflash fully?

Cheers A

kbseah commented 4 years ago

Hello, thanks for your question. At the moment that's not possible for the HTML report, but you can plot and summarize the NTU counts at different taxonomic levels with the phyloFlash_compare.pl script. Use the --level to choose the taxonomic level to perform the comparison (a number between 1 and 7), and --task ntu_table option to report the counts in a table. Choose --task barplot to visualize the counts as a barplot instead. See phyloFlash_compare.pl --man for full listing of the other options. Hope that this helps!

AmaliT commented 4 years ago

Great thanks @kbseah, I will look into that. Much appreciated

AmaliT commented 4 years ago

Hi @kbseah with the phyloFlash_compare.pl could you use wild card for csv input? Also is there a limit to number of files you can input as for comparison? Also could the barplot subset take more than one parameter?

kbseah commented 4 years ago

Unfortunately you can't use wild cards to specify the input (either CSV or tar.gz archive). They have to be a comma-separated list.

You don't have to extract the CSV files manually. If you ran phyloFlash with the -everything or -almosteverything or -zip option, it will pack all the output files in a single tar.gz archive. If you have a single folder with the zip archives that you want to compare, you can then run the compare script in that folder with the -allzip option. It will parse the sample names from the filenames.

The main limitation with input is that the plots may no longer be legible if you have too many samples. For the barplots, you can adjust the plot size with the --barplot_scaleplotwidth parameter. If you have a large number of samples you can consider producing the output either as plain text NTU table or distance matrix, and then plotting it yourself, so that you can have finer control over the plotting.

When running phyloFlash for output that you want to compare later on, be sure to set the taxonomic level to something at or below the level that you want to compare later (e.g. genus, when you want to compare at family level).

Finally, there was a bug in v3.3b3 regarding how reads were counted. This has been fixed in v3.3b4 but there is some delay in updating the recipe in Bioconda. If you have installed it via Bioconda do check the version that you are using. v3.3b2 does not have this bug.

AmaliT commented 4 years ago

Thanks @kbseah very helpful; I am using v3.3.b4. Could you please advise me on regard to: "Also could the barplot subset take more than one parameter?"

kbseah commented 4 years ago

Sorry I missed that question. No it only takes one parameter.

AmaliT commented 4 years ago

Sorry one more question; whats the difference between LIBNAME.phyloFlash.NTUabundance.csv and LIBNAME.phyloFlash.NTUfull_abundance.csv?

kbseah commented 4 years ago

Both of these are produced by the main phyloFlash.pl script. LIBNAME.phyloFlash.NTUfull_abundance.csv contains the full taxonomy string from the initial mapping, whereas LIBNAME.phyloFlash.NTUabundance.csv summarizes the NTUs to the taxon level (e.g. family or class) that is requested by the -taxlevel parameter when running phyloFlash.pl. The NTUfull table is also output in case the user wants to perform the comparisons at a lower taxonomic level than the original summary.

AmaliT commented 4 years ago

great thanks @kbseah

AmaliT commented 4 years ago

Hi @kbseah, another question/potential bug regarding the taxonomy ranks and compare plots; I noticed that with the full rank - it varies from in some case it only goes up to phylum or other times species etc and if I run compare script at a certain level at times (in particular lower ranks like 6 - family level?); the plot seem to have a mix bag. Some would be at phylum level eg "streptophyta" and the other at another taxonomic rank leve (example below). So I was wondering if this is a bug as in the script I noticed the use of ';' to separate out taxa (line 826 in phyloFlash_compare.pl) and extract the level user requests - however the level that get extracted as a result might not be equivalent to the expected Linnaeus taxonomy rank. eg with this string Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta; Streptophyta is the 6th element; but not family level as expected.

So I was wondering if there is a way around this; such as attaching prefix letters in the correct place like qimme output eg p_Streptophyta or something similar? Any suggestions would be appreciated.

Screen Shot 2020-05-23 at 7 23 21 PM
kbseah commented 4 years ago

Hello @AmaliT ,

Unfortunately this is a limitation of the underlying SILVA database that's used by phyloFlash. (More info: https://www.arb-silva.de/documentation/silva-taxonomy/). Eukaryote taxonomy tends to have more higher taxa above species level, so there are more intervening ranks compared to the prokaryotes. Because of the different criteria for how much sequence divergence corresponds to which taxonomic rank, it's not really realistic to compare eukaryotes and prokaryotes together.

One possibility is to run the comparison twice, once for eukaryotes and once for prokaryotes. I also see that you're working with plants, and I'm not sure whether the nuclear rRNA genes in plants are divergent enough to resolve species, for example.

Hope that this helps you somewhat.

Best regards, Brandon

AmaliT commented 4 years ago

Thanks @kbseah for the reply. I was wondering is there a way to pipe the matrix that feeds into R to create the plots as an output? would that be line 572 - 575 on compare.pl script?

AmaliT commented 4 years ago

Also one other question; is it possible to convert the output from phyloFlash to biom format?

kbseah commented 4 years ago

Hello @AmaliT ,

Thanks @kbseah for the reply. I was wondering is there a way to pipe the matrix that feeds into R to create the plots as an output? would that be line 572 - 575 on compare.pl script?

Do you mean you would like to import the table of NTU counts per sample into R and perform the plots yourself? That can be done with the option -task ntu_table with phyloFlash_compare.pl. This would produce the same table that goes into making the barplot.

Also one other question; is it possible to convert the output from phyloFlash to biom format?

Sorry that functionality hasn't been implemented. I'm not familiar with biom format. Is it something that you would recommend?

Best regards, Brandon

AmaliT commented 4 years ago

Hi @kbseah

Do you mean you would like to import the table of NTU counts per sample into R and perform the plots yourself? That can be done with the option -task ntu_table with phyloFlash_compare.pl. This would produce the same table that goes into making the barplot.

Thanks, just what I was after. Sorry I missed it in the documentation.

Also one other question; is it possible to convert the output from phyloFlash to biom format?

Sorry that functionality hasn't been implemented. I'm not familiar with biom format. Is it something that you would recommend?

[Biom format](https://biom-format.org/#:~:text=The%20BIOM%20file%20format%20(canonically,Genomics%20Standards%20Consortium%20supported%20project) is a standardised format used by some metagenomic applications; including some visualisation tools available. So I was wondering how feasible it would be to convert phyloFlash output to that - so that output can be fed into more downstream applications. Do you have any advice on any downstream applications that migh be useful to feed the phyloFlash output to explore the output further?

Thanks again for all your help

Cheers Amali

kbseah commented 4 years ago

Hello Amali,

Thanks for the info about biom format! At the moment we don't have any plans to implement it for phyloFlash, but it's something I'll keep in mind.

The main difficulty I think would be squaring the typical OTU paradigm with what phyloFlash is doing. For example, with a single marker gene you would assign each sequence to an OTU and then use the counts per OTU as a proxy for abundance. With phyloFlash it's not so straightforward because one is mapping sequences that could potentially come from any point in the SSU rRNA gene. Because the SSU rRNA gene is a mix of conserved and variable regions, reads mapping to one part may have a different taxonomic assignment than reads mapping to another part of the same gene, even though they ultimately come from the same genome. That's one reason why we don't use the term "OTU" in our output summaries, and why phyloFlash output should always be taken with a pinch of salt.

Another difficulty is inherent in the reference database we use. The SILVA project has its own taxonomy, which differs from the NCBI taxonomy in many places. Depending on what other tools you are using there could be incompatibilities in the taxonomy that you would have to manually resolve, but I guess this would be the case with any other tools that you use.

This preprint might be useful if you're interested in quantifying microbes from sequencing data (amplicon or metagenome) https://doi.org/10.1101/2020.05.19.103937

Best regards, Brandon

AmaliT commented 4 years ago

Hi Brandon

Thanks heaps explaining it through and for the reference. Much appreciated

One other aspect I was wondering was could I use the rRNA.fasta file generated from phyloFlash to run amplicon pipelines? Thoughts?

Kind regards Amali

kbseah commented 4 years ago

Hi Amali,

One other aspect I was wondering was could I use the rRNA.fasta file generated from phyloFlash to run amplicon pipelines? Thoughts?

I wouldn't recommend it, because only a fraction of the total rRNA reads in the library will end up being assembled into full-length sequences. If an organism's genome has insufficient coverage, then it won't be represented in the assembled sequences, even if it is detected by mapping. Paradoxically if coverage is too high, some sequences may not assemble either, which may be caused by heterozygosity or maybe multiple related paralogs.

Also, amplicon methods are good at picking up low-abundance taxa, whereas the phyloFlash approach is targeted at the most abundant members, because only a small fraction of total metagenomic reads derive from rRNA genes.

@HRGV 's original motivation for the assembly step was to get full-length sequences for phylogenetics. Short amplicons tend to be ok for classifying known organisms but not for identifying novel divergent taxa. Hope that this helps!

Best regards, Brandon

AmaliT commented 4 years ago

Thanks @kbseah for all the advice and reply. Much appreciated.

kbseah commented 4 years ago

@HRGV i think we can close this issue as it appears to have been resolved