Closed jason-c-kwan closed 4 years ago
- The name of the script and the class is perhaps not completely logical. Most of the time a MAG class represents a kingdom bin and not a metagenome-assembled-genome (MAG). There is also a problem with the name "bin", because this is already a function in Python that converts an integer to a binary string. Perhaps something like meta_bin would be more appropriate?
How about metabin.py
and MetaBin
for the class?
- There are two attributes in the MAG class that are kind of similar - self.assembly_name, which is defined as self.basename.split('.')[0] and self.root which is defined as os.path.splitext(self.basename)[0]. It looks like self.root is used for self.nucls_fname and self.prots_fname, but is self.assembly_name used or needed?
self.assembly_name
has now been removed.
A number of the methods in the MAG class have the
@property
decorator, and so they are sort of like attributes but are computed when they are called upon. The problem here is that a number of them call self.get_seqs, which returns the BioPython SeqIO records for the MAG. This means that when totalsize, size_pct, nallseqs, seqs_pct, size, or gc_content are called, the fasta file the MAG is based on will be parsed each time. I would suggest the following - perhaps the get_seqs method could populate all the other values when it is first run, and this other things like totalsize could be converted from a@property
to a @cached_property as defined in the functools module. A @cached_property will remember the value after it is first run to save on re-computing time.The gc_content property in the MAG class uses a simple np.mean statement. I think it would be more appropriate to calculate a length-weighted mean of GC, because otherwise the GC value will tend to be skewed by short contigs, especially if the bin is made up of a lot of short contigs. The thing I am worried about here is that the GC content of short contigs is less likely to be completely representative of their parent genomes.
Time consuming properties have been cached using python3.7 implementation. GC is now weighted by length. Example of caching is below with the GC implementation:
import os
import numpy as np
from functools import lru_cache
from Bio import SeqIO
from Bio.SeqUtils import GC
class MiniMetaBin:
def __init__(self, assembly, contigs, outdir=None):
self.assembly = os.path.realpath(assembly)
self.contigs = contigs
@property
@lru_cache(maxsize=None)
def seqrecords(self):
"""Retrieve seqrecords from assembly contained in `self.contigs`.
Returns
-------
list
list of SeqIO [SeqRecords, ...]
"""
return [seq for seq in SeqIO.parse(self.assembly, 'fasta')
if seq.id in self.contigs]
@property
@lru_cache(maxsize=None)
def length_weighted_gc(self):
weights = [len(rec.seq)/self.size for rec in self.seqrecords]
gc_content = [SeqUtils.GC(rec.seq) for rec in self.seqrecords]
return np.average(a=gc_content, weights=weights)
The prepared method seems to check whether the path given in the argument fpath exists and is non-zero size. I don't think this is a great way to determine if a step finished because the file could be there, be non-zero size, but still be corrupted.
I've added a # COMBAK
comment, as this functionality should be imported from utilities. This is something that is prevalent across the pipeline. To avoid redundancy we should implement a generalizable function or decorator and import.
- The split_nucleotides method is not yet implemented, and there is no docstring describing what it is supposed to do, so I can't tell if it should be implemented right now or not.
I have removed this to avoid confusion
- The get_binning method appears to expect a pandas DataFrame for the 'coverage' argument in **kwargs, but expects a path for the 'taxonomy' argument, which it then loads into a DataFrame. It is not entirely clear why there is this difference in data types expected.
The coverage file path is now used in the method
This file contains the definition of the MAG class, which is basically a subset of the contigs in a metagenome. The issues that I came across are outlined below:
[x] The name of the script and the class is perhaps not completely logical. Most of the time a MAG class represents a kingdom bin and not a metagenome-assembled-genome (MAG). There is also a problem with the name "bin", because this is already a function in Python that converts an integer to a binary string. Perhaps something like
meta_bin
would be more appropriate?[x] The MAG class itself does not have a docstring.
[x] There are two attributes in the MAG class that are kind of similar -
self.assembly_name
, which is defined asself.basename.split('.')[0]
andself.root
which is defined asos.path.splitext(self.basename)[0]
. It looks likeself.root
is used forself.nucls_fname
andself.prots_fname
, but isself.assembly_name
used or needed?[x] A number of the methods in the MAG class have the
@property
decorator, and so they are sort of like attributes but are computed when they are called upon. The problem here is that a number of them callself.get_seqs
, which returns the BioPython SeqIO records for the MAG. This means that whentotalsize
,size_pct
,nallseqs
,seqs_pct
,size
, orgc_content
are called, the fasta file the MAG is based on will be parsed each time. I would suggest the following - perhaps theget_seqs
method could populate all the other values when it is first run, and this other things liketotalsize
could be converted from a@property
to a@cached_property
as defined in the functools module. A@cached_property
will remember the value after it is first run to save on re-computing time.[x] The
gc_content
property in the MAG class uses a simplenp.mean
statement. I think it would be more appropriate to calculate a length-weighted mean of GC, because otherwise the GC value will tend to be skewed by short contigs, especially if the bin is made up of a lot of short contigs. The thing I am worried about here is that the GC content of short contigs is less likely to be completely representative of their parent genomes.[x] Many of the methods and properties of the MAG class do not have docstrings, which is important now that we are using docstrings directly for the documentation.
[ ] The
prepared
method seems to check whether the path given in the argumentfpath
exists and is non-zero size. I don't think this is a great way to determine if a step finished because the file could be there, be non-zero size, but still be corrupted.[x] The
get_orfs
method retrieves ORFs corresponding to the MAG, and one of the arguments isorf_type='prot'
. Near the beginning of the method, the values ofself.prot_orfs_fpath
andself.nucl_orfs_fpath
are put into a dictionary calledorfs_fpaths
, keyed under'prot'
and'nucl'
, the two accepted values oforf_type
. Then the appropriate path is picked using adict.get
statement:https://github.com/KwanLab/Autometa/blob/3e8c7718dc63e5a06654a899506535ae8bd6219e/autometa/common/mag.py#L134-L135
The problem I see here is that if the
orf_type
argument is not'prot'
or'nucl'
, then the method silently just assumes'prot'
, and I just think it would be better for bug fixing to throw an error instead.[x] The
split_nucleotides
method is not yet implemented, and there is no docstring describing what it is supposed to do, so I can't tell if it should be implemented right now or not.[x] In the
get_binning
method, it might be useful to outline in the docstring what parameters are currently accepted/expected in the**kwargs
argument.[x] In the
get_binning
method, right now the embedding method that is requested is'UMAP'
rather than'bhtsne'
. Also, I think that the method arguments forkmers.embed
are now lowercase and so this method will crash. My point here is that if we are trying to emulate Autometa 1.0 functionality, the default should be'bhtsne'
.[x] The
get_binning
method appears to expect a pandas DataFrame for the'coverage'
argument in**kwargs
, but expects a path for the'taxonomy'
argument, which it then loads into a DataFrame. It is not entirely clear why there is this difference in data types expected.