KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes
https://autometa.readthedocs.io
Other
40 stars 15 forks source link

[dev branch] Issues with autometa/common/mag.py #53

Closed jason-c-kwan closed 4 years ago

jason-c-kwan commented 4 years ago

This file contains the definition of the MAG class, which is basically a subset of the contigs in a metagenome. The issues that I came across are outlined below:

  1. [x] The name of the script and the class is perhaps not completely logical. Most of the time a MAG class represents a kingdom bin and not a metagenome-assembled-genome (MAG). There is also a problem with the name "bin", because this is already a function in Python that converts an integer to a binary string. Perhaps something like meta_bin would be more appropriate?

  2. [x] The MAG class itself does not have a docstring.

  3. [x] There are two attributes in the MAG class that are kind of similar - self.assembly_name, which is defined as self.basename.split('.')[0] and self.root which is defined as os.path.splitext(self.basename)[0]. It looks like self.root is used for self.nucls_fname and self.prots_fname, but is self.assembly_name used or needed?

  4. [x] A number of the methods in the MAG class have the @property decorator, and so they are sort of like attributes but are computed when they are called upon. The problem here is that a number of them call self.get_seqs, which returns the BioPython SeqIO records for the MAG. This means that when totalsize, size_pct, nallseqs, seqs_pct, size, or gc_content are called, the fasta file the MAG is based on will be parsed each time. I would suggest the following - perhaps the get_seqs method could populate all the other values when it is first run, and this other things like totalsize could be converted from a @property to a @cached_property as defined in the functools module. A @cached_property will remember the value after it is first run to save on re-computing time.

  5. [x] The gc_content property in the MAG class uses a simple np.mean statement. I think it would be more appropriate to calculate a length-weighted mean of GC, because otherwise the GC value will tend to be skewed by short contigs, especially if the bin is made up of a lot of short contigs. The thing I am worried about here is that the GC content of short contigs is less likely to be completely representative of their parent genomes.

  6. [x] Many of the methods and properties of the MAG class do not have docstrings, which is important now that we are using docstrings directly for the documentation.

  7. [ ] The prepared method seems to check whether the path given in the argument fpath exists and is non-zero size. I don't think this is a great way to determine if a step finished because the file could be there, be non-zero size, but still be corrupted.

  8. [x] The get_orfs method retrieves ORFs corresponding to the MAG, and one of the arguments is orf_type='prot'. Near the beginning of the method, the values of self.prot_orfs_fpath and self.nucl_orfs_fpath are put into a dictionary called orfs_fpaths, keyed under 'prot' and 'nucl', the two accepted values of orf_type. Then the appropriate path is picked using a dict.get statement:

https://github.com/KwanLab/Autometa/blob/3e8c7718dc63e5a06654a899506535ae8bd6219e/autometa/common/mag.py#L134-L135

The problem I see here is that if the orf_type argument is not 'prot' or 'nucl', then the method silently just assumes 'prot', and I just think it would be better for bug fixing to throw an error instead.

  1. [x] The split_nucleotides method is not yet implemented, and there is no docstring describing what it is supposed to do, so I can't tell if it should be implemented right now or not.

  2. [x] In the get_binning method, it might be useful to outline in the docstring what parameters are currently accepted/expected in the **kwargs argument.

  3. [x] In the get_binning method, right now the embedding method that is requested is 'UMAP' rather than 'bhtsne'. Also, I think that the method arguments for kmers.embed are now lowercase and so this method will crash. My point here is that if we are trying to emulate Autometa 1.0 functionality, the default should be 'bhtsne'.

  4. [x] The get_binning method appears to expect a pandas DataFrame for the 'coverage' argument in **kwargs, but expects a path for the 'taxonomy' argument, which it then loads into a DataFrame. It is not entirely clear why there is this difference in data types expected.

evanroyrees commented 4 years ago
  1. The name of the script and the class is perhaps not completely logical. Most of the time a MAG class represents a kingdom bin and not a metagenome-assembled-genome (MAG). There is also a problem with the name "bin", because this is already a function in Python that converts an integer to a binary string. Perhaps something like meta_bin would be more appropriate?

How about metabin.py and MetaBin for the class?

evanroyrees commented 4 years ago
  1. There are two attributes in the MAG class that are kind of similar - self.assembly_name, which is defined as self.basename.split('.')[0] and self.root which is defined as os.path.splitext(self.basename)[0]. It looks like self.root is used for self.nucls_fname and self.prots_fname, but is self.assembly_name used or needed?

self.assembly_name has now been removed.

evanroyrees commented 4 years ago
  1. A number of the methods in the MAG class have the @property decorator, and so they are sort of like attributes but are computed when they are called upon. The problem here is that a number of them call self.get_seqs, which returns the BioPython SeqIO records for the MAG. This means that when totalsize, size_pct, nallseqs, seqs_pct, size, or gc_content are called, the fasta file the MAG is based on will be parsed each time. I would suggest the following - perhaps the get_seqs method could populate all the other values when it is first run, and this other things like totalsize could be converted from a @property to a @cached_property as defined in the functools module. A @cached_property will remember the value after it is first run to save on re-computing time.

  2. The gc_content property in the MAG class uses a simple np.mean statement. I think it would be more appropriate to calculate a length-weighted mean of GC, because otherwise the GC value will tend to be skewed by short contigs, especially if the bin is made up of a lot of short contigs. The thing I am worried about here is that the GC content of short contigs is less likely to be completely representative of their parent genomes.

Time consuming properties have been cached using python3.7 implementation. GC is now weighted by length. Example of caching is below with the GC implementation:

import os

import numpy as np

from functools import lru_cache
from Bio import SeqIO
from Bio.SeqUtils import GC

class MiniMetaBin:
    def __init__(self, assembly, contigs, outdir=None):
        self.assembly = os.path.realpath(assembly)
        self.contigs = contigs
    @property
    @lru_cache(maxsize=None)
    def seqrecords(self):
        """Retrieve seqrecords from assembly contained in `self.contigs`.

        Returns
        -------
        list
            list of SeqIO [SeqRecords, ...]

        """
        return [seq for seq in SeqIO.parse(self.assembly, 'fasta')
            if seq.id in self.contigs]

    @property
    @lru_cache(maxsize=None)
    def length_weighted_gc(self):
        weights = [len(rec.seq)/self.size for rec in self.seqrecords]
        gc_content = [SeqUtils.GC(rec.seq) for rec in self.seqrecords]
        return np.average(a=gc_content, weights=weights)
evanroyrees commented 4 years ago
  1. The prepared method seems to check whether the path given in the argument fpath exists and is non-zero size. I don't think this is a great way to determine if a step finished because the file could be there, be non-zero size, but still be corrupted.

I've added a # COMBAK comment, as this functionality should be imported from utilities. This is something that is prevalent across the pipeline. To avoid redundancy we should implement a generalizable function or decorator and import.

evanroyrees commented 4 years ago
  1. The split_nucleotides method is not yet implemented, and there is no docstring describing what it is supposed to do, so I can't tell if it should be implemented right now or not.

I have removed this to avoid confusion

evanroyrees commented 4 years ago
  1. The get_binning method appears to expect a pandas DataFrame for the 'coverage' argument in **kwargs, but expects a path for the 'taxonomy' argument, which it then loads into a DataFrame. It is not entirely clear why there is this difference in data types expected.

The coverage file path is now used in the method