Master Issue - v0.9 - Githubissues

sminot commented 3 years ago

Now that there is a critical mass of minor improvements to be made, I am creating a master issue to organize all of the changes which will go into a new minor release. I am describing this as a minor release because we will not be breaking the syntax of the inputs. The changes to be made will certainly change the content of what is being output, but that should be restricted to either (a) improved quality of existing output objects or (b) the addition of new output objects which extend the functionality of the workflow.

The most substantive change to the workflow will be the implementation of a new CAG clustering approach which uses co-assembly information to boost the performance of co-abundance clustering.

I'd like to use this issue to discuss what features and fixes will be included in the release, as well as tracking which existing issues will be subsumed by this one.

cc @jgolob

sminot commented 3 years ago

Items:

[x] Enhance the performance of co-abundance clustering using co-assembly information
[x] add extractCounts to the default workflow even when corncob is not run (#53)
[x] Add rRNA prediction on contigs (#42)
[ ] Check for empty inputs (#28)
[x] Add antiSMASH for pathway identification (#26)
[x] Aggregate results in file objects which are amenable for interactive visualization
[x] Add E-value threshold for taxonomic assignment
[x] Add E-value threshold for eggNOG functional classification assignment
[x] Update DSL-2 syntax for NXF_VER=20.10.0

sminot commented 3 years ago

On the topic of "Aggregate results in file objects which are amenable for interactive visualization," my first thought is to save a redis store. After reading this, I will first try RDB format.

The biggest question here is going to be the tradeoff between file format, file size, and file content. I'd like to include everything needed for of one particular approach to visualization, but the file should be less than a gigabyte for a typical experiment, and should be able to be read quickly. This requirement is making me want to move away from HDF for the visualization portion, but the use of redis may need to be reassessed depending on how things go.

The smaller file size will also hopefully be supported by increasing the sensitivity of CAG clustering, which should reduce the overall number of CAGs.

The tables that I think would be needed for visualization are:

(per CAG) Size (number of genes)
(per CAG) Majority taxonomic assignment of genes per each CAG (for labeling purposes only) at each taxonomic rank
(per CAG) Estimated coefficients of association (and p-values) for each covariate
(per CAG) Number of genes assigned at each taxonomic rank
(per CAG) Number of genes assigned to each unique function (via eggNOG)
(per CAG) Relative proportion of gene copies which are assigned to this CAG across all specimens
(per CAG) Ordination layout for plotting on the basis of taxonomic classification spectra
(per dataset) User-provided metadata sheet (with R1/R2 removed, and all specimens deduplicated)
(per taxon) Set of CAGs which contain genes assigned to that taxon
(per taxon) Relative proportion of gene copies which are assigned to this taxon across all specimens
(per function) Set of CAGs which contain genes assigned to this eggNOG function
(per function) Relative proportion of gene copies which are assigned to this function across all specimens

sminot commented 3 years ago

The major update to co-abundance gene clustering (in 5eb1441) implements iterative agglomerative clustering of genes, using co-assembly information to prioritize the order in which genes are added. This approach also implements a very simple filtering mechanism to limit the set of clustered genes to those which are assembled into a contig with >=X depth which contains >=Y genes in total. These parameters are --min_contig_depth and --min_contig_size

sminot commented 3 years ago

DSL-2 syntax implemented with f0b609ed7e936b26283deb88343897a21841f3c6

sminot commented 3 years ago

I did some digging into antiSMASH for metabolite prediction, and it appears that this software suite may be more suitable for running in a separate workflow, and less suitable for running within geneshot.

On a practical level, the antiSMASH software is distributed as a Docker image which includes both the code and reference database, which is far too large for the default Docker partition on many batch computing systems (including AWS Batch) and may prove to be an extremely challenging reconfiguration for many users.

On a more theoretical level, the output of the antiSMASH software is highly oriented towards the human inspection of genomic loci (see their very nicely written documentation). This is extremely interesting and useful, but it is fundamentally distinct from the gene-oriented analysis performed by geneshot. In other words, the units of analysis for antiSMASH (being the operon or genomic region) are orthogonal to anything which geneshot provides at the moment.

On another theoretical note, the reliance of antiSMASH on long assembled contigs may be confounded by the highly fragmented assemblies which result from short-read metagenomics. Further advancements in either metabolomics analysis or in long-read sequencing may change this calculus in the future.

Planning for the future, I am more inclined to incorporate metabolomics analysis into geneshot when we can use tools which output (e.g.) the predicted abundance of metabolites for a single specimen.

Do you have thoughts on the addition of antiSMASH or other tools like it into geneshot, @jgolob ?

sminot commented 3 years ago

One fairly major change that I'd like to implement relates to how we consider the association of organism abundances with the experimental design, with regards to taxonomic and functional annotations.

In the pre-v0.9 approach, the experimental design was used to estimate the association of CAG relative abundances with a user-provided formula. The connection with taxa or functions was then made with the betta approach, which considered the subset of CAGs which contained any genes with those annotations.

Instead, what would be more direct would be to analyze the taxonomic groups and the eggNOG functional annotation groups in the exact same way as the CAGs, by running corncob on the readcounts summed over the group of genes which share the same grouping.

sminot commented 3 years ago

The "interactive visualization" feature is now implemented as buildRedis

sminot commented 3 years ago

After spending time with this release, I think that the antiSMASH approach is not going to be easily integrated into the codebase, at least not for this release.

Golob-Minot / geneshot

Master Issue - v0.9 #56