Golob-Minot / geneshot

MIT License
28 stars 5 forks source link

Master Issue - v0.9 #56

Open sminot opened 3 years ago

sminot commented 3 years ago

Now that there is a critical mass of minor improvements to be made, I am creating a master issue to organize all of the changes which will go into a new minor release. I am describing this as a minor release because we will not be breaking the syntax of the inputs. The changes to be made will certainly change the content of what is being output, but that should be restricted to either (a) improved quality of existing output objects or (b) the addition of new output objects which extend the functionality of the workflow.

The most substantive change to the workflow will be the implementation of a new CAG clustering approach which uses co-assembly information to boost the performance of co-abundance clustering.

I'd like to use this issue to discuss what features and fixes will be included in the release, as well as tracking which existing issues will be subsumed by this one.

cc @jgolob

sminot commented 3 years ago

Items:

sminot commented 3 years ago

On the topic of "Aggregate results in file objects which are amenable for interactive visualization," my first thought is to save a redis store. After reading this, I will first try RDB format.

The biggest question here is going to be the tradeoff between file format, file size, and file content. I'd like to include everything needed for of one particular approach to visualization, but the file should be less than a gigabyte for a typical experiment, and should be able to be read quickly. This requirement is making me want to move away from HDF for the visualization portion, but the use of redis may need to be reassessed depending on how things go.

The smaller file size will also hopefully be supported by increasing the sensitivity of CAG clustering, which should reduce the overall number of CAGs.

The tables that I think would be needed for visualization are:

sminot commented 3 years ago

The major update to co-abundance gene clustering (in 5eb1441) implements iterative agglomerative clustering of genes, using co-assembly information to prioritize the order in which genes are added. This approach also implements a very simple filtering mechanism to limit the set of clustered genes to those which are assembled into a contig with >=X depth which contains >=Y genes in total. These parameters are --min_contig_depth and --min_contig_size

sminot commented 3 years ago

DSL-2 syntax implemented with f0b609ed7e936b26283deb88343897a21841f3c6

sminot commented 3 years ago

I did some digging into antiSMASH for metabolite prediction, and it appears that this software suite may be more suitable for running in a separate workflow, and less suitable for running within geneshot.

On a practical level, the antiSMASH software is distributed as a Docker image which includes both the code and reference database, which is far too large for the default Docker partition on many batch computing systems (including AWS Batch) and may prove to be an extremely challenging reconfiguration for many users.

On a more theoretical level, the output of the antiSMASH software is highly oriented towards the human inspection of genomic loci (see their very nicely written documentation). This is extremely interesting and useful, but it is fundamentally distinct from the gene-oriented analysis performed by geneshot. In other words, the units of analysis for antiSMASH (being the operon or genomic region) are orthogonal to anything which geneshot provides at the moment.

On another theoretical note, the reliance of antiSMASH on long assembled contigs may be confounded by the highly fragmented assemblies which result from short-read metagenomics. Further advancements in either metabolomics analysis or in long-read sequencing may change this calculus in the future.

Planning for the future, I am more inclined to incorporate metabolomics analysis into geneshot when we can use tools which output (e.g.) the predicted abundance of metabolites for a single specimen.

Do you have thoughts on the addition of antiSMASH or other tools like it into geneshot, @jgolob ?

sminot commented 3 years ago

One fairly major change that I'd like to implement relates to how we consider the association of organism abundances with the experimental design, with regards to taxonomic and functional annotations.

In the pre-v0.9 approach, the experimental design was used to estimate the association of CAG relative abundances with a user-provided formula. The connection with taxa or functions was then made with the betta approach, which considered the subset of CAGs which contained any genes with those annotations.

Instead, what would be more direct would be to analyze the taxonomic groups and the eggNOG functional annotation groups in the exact same way as the CAGs, by running corncob on the readcounts summed over the group of genes which share the same grouping.

sminot commented 3 years ago

The "interactive visualization" feature is now implemented as buildRedis

sminot commented 3 years ago

After spending time with this release, I think that the antiSMASH approach is not going to be easily integrated into the codebase, at least not for this release.