Closed jason-c-kwan closed 4 years ago
- In the
__init__
method of theMetagenome
class, a path to a taxonomy table must be supplied. During the normal course of business, this will be generated by other class methods. However, ifself.taxonomy_assigned
isTrue
, then we load an existing file into a pandas DataFrame and assign it toself.taxonomy
. The trouble is, it is unclear to me howself.taxonomy_assigned
would ever be set toTrue
when aMetagenome
object is created, so this mechanism needs to be refined.
taxonomy_assigned
is a property. This means the value is updated as corresponding values are updated. Therefore, when self.taxonomy_fpath
is written, self.taxonomy_assigned
will then evaluate to True
. We only want to read taxonomy_fpath
if the file exists, otherwise we keep this value as None
.
Would self.taxonomy
as a property resolve this?
- The
Metagenome
class has a number of methods with the@property
decorator, which means they are like attributes which are calculated each time. The trouble is that a number of them rely on theself.sequences
property. What this does is that it will parse through the assembly fasta file, and return a list of BioPython SeqIO objects. With a large assembly, this itself takes some time, andself.sequences
is run forself.nseqs
,self.mean_gc
,self.size
,self.largest_seq
,self.fragmentation_metric
(twice),self.describe
(seven times!), andlength_filter
. I think eitherself.sequences
should just load everything into memory, or perhaps if that is deemed to use too much memory, the method could populate all the dependent properties upon first running, and the dependent properties could instead be marked with@cached_property
. This is a decorator defined in thefunctools
module, which makes the method remember the return value after the first time. Connected with this issue, I am unclear on the difference betweenself.sequences
andself.seqrecords
, and the need for both methods.
@cached_property
is new in version 3.8. We could update for python 3.8. The alternative for python=3.7 has been implemented.
Connected with this issue, I am unclear on the difference between self.sequences and self.seqrecords, and the need for both methods.
self.sequences
is a faster parser that does not create SeqRecord objects for each sequence. So in terms of IO, self.sequences
is better at retrieving sequences without the need for any header information. self.seqrecords
will return a list of SeqIO.SeqRecord
objects that have their associated methods for handling these records. The former is more efficient with memory usage and IO, where the latter has more functionality for finer tuned handling of sequences.
- the Metagenome class, the orfs method checks whether self.orfs_called is True. If not, then it calls self.call_orfs. However, in the self.call_orfs method, it doesn't set self.orfs_called to True, so essentially we will have to re-run prodigal every time self.orfs is called
This does not occur. self.orfs_called
will evaluate to True if the files are written. This is property specific behavior.
- In the
get_kmers
method of theMetagenome
class, the lastreturn
line runs ifnormalize_kmers
is set toFalse
. I just think it would improve readability if that last line was in anelse
block. So instead of:Have this:
if normalized_kmers:
normalized_df = kmers.normalize(kmers_df)
normalized_df.to_csv(normalized_df, sep="\t", header=True, index=True)
return normalized_df
else:
return kmers_df
Also, note in the quoted block above there is an error - the
normalized
variable is not defined. It appears this should benormalized_df
.
normalized
is one of the function arguments. This evaluates the boolean normalized_kmers
and is the path to write the dataframe. The code-block you have provided is trying to write a dataframe to a dataframe.
if normalized_kmers:
normalized_df = kmers.normalize(kmers_df)
normalized_df.to_csv(normalized, sep="\t", header=True, index=True)
return normalized_df
else:
return kmers_df
- This method also relies on self.orfs_called (see above). Also, this method should probably assign self.taxonomy_assigned when it is done, but it doesn't
self.orfs_called
is a property that will evaluate to True once the ORFs are called. Similarly, self.taxonomy_assigned
will evaluate to True once taxonomy is assigned.
- The
write_ranks
method seems to duplicate some functionality of the MAG class, and I'm not sure under what circumstances it would be needed?
This is here in case a user just wants to write out the fastas split at a canonical rank without performing binning. Instead of splitting at the superkingdom level (the default) you can also provide more specific ranks. There have been groups that have used the pipeline for taxonomic assignment and then performed there own binning, so this gives a bit more flexibility. We can remove, move to mag.py
or keep it as is.
I guess I missed the point of properties 😄
This file contains the definition of the
Metagenome
class and accompanying methods for handling data. In reviewing it, I noticed the following issues:In the
main
function, the--cpus
argument should be specified as anint
.In the
Metagenome
docstring, it would be useful to specify that the attributesize
is the size of the assembly in bp.In the
__init__
method of theMetagenome
class, a path to a taxonomy table must be supplied. During the normal course of business, this will be generated by other class methods. However, ifself.taxonomy_assigned
isTrue
, then we load an existing file into a pandas DataFrame and assign it toself.taxonomy
. The trouble is, it is unclear to me howself.taxonomy_assigned
would ever be set toTrue
when aMetagenome
object is created, so this mechanism needs to be refined.The
Metagenome
class has a number of methods with the@property
decorator, which means they are like attributes which are calculated each time. The trouble is that a number of them rely on theself.sequences
property. What this does is that it will parse through the assembly fasta file, and return a list of BioPython SeqIO objects. With a large assembly, this itself takes some time, andself.sequences
is run forself.nseqs
,self.mean_gc
,self.size
,self.largest_seq
,self.fragmentation_metric
(twice),self.describe
(seven times!), andlength_filter
. I think eitherself.sequences
should just load everything into memory, or perhaps if that is deemed to use too much memory, the method could populate all the dependent properties upon first running, and the dependent properties could instead be marked with@cached_property
. This is a decorator defined in thefunctools
module, which makes the method remember the return value after the first time. Connected with this issue, I am unclear on the difference betweenself.sequences
andself.seqrecords
, and the need for both methods.In the
Metagenome
class, it might be better to have themean_gc
property calculate a length-weighted mean GC, because otherwise short contigs will skew the result because they invariably outnumber long contigs, and are less accurate in terms of GC. Perhaps a simple way of calculating this is just to total up the G + C (as well as the appropriate fraction of ambiguous bases), and express as a percentage of the total size.The
orfs_called
property in theMetagenome
class works out ifself.prot_orfs_fpath
andself.nucl_orfs_fpath
both exist and are non-zero size. This is not a very sophisticated way of determining if the ORFs have been called because there could have been an error half way through the prodigal run, for instance. See also thetaxonomy_assigned
property.In the
length_filter
method of theMetagenome
class, the first thing that is done is check that thecutoff
argument is afloat
orint
, but it is not done very elegantly:https://github.com/KwanLab/Autometa/blob/3e8c7718dc63e5a06654a899506535ae8bd6219e/autometa/common/metagenome.py#L266-L271
See this Stackoverflow for discussion on this: https://stackoverflow.com/questions/4187185/how-can-i-check-if-my-python-object-is-a-number
The initial casting of the variable as a
float
is a problem, because abool
will cast as afloat
, and also some strings like'inf'
and'-inf'
. The Stackoverflow gives some code that can be used to determine the variable is anint
or afloat
but NOT abool
.The
length_filter
method of theMetagenome
class returns a newMetagenome
object after writing a new fasta file with just the contigs with lengths greater than thecutoff
argument. However, the new object inherits the samenucl_orfs_fpath
andprot_orfs_fpath
as the old object. This is fine as long as ORFs have not been called beforelength_filter
, but if done the other way round then the new object will contain ORFs that are not present on the contigs in the new object (because they originate from shorter contigs). Can we guarantee that this will never happen? Because I don't think there are checks for this elsewhere (for instance, theorfs
method just parses one of the orf files stored as an attribute). I am just worried that this will be difficult to troubleshoot if it happens. At the very least, a comment should be added somewhere (perhaps so it ends up in the documentation) explaining this potential gotcha.In the
Metagenome
class, theorfs
method checks whetherself.orfs_called
isTrue
. If not, then it callsself.call_orfs
. However, in theself.call_orfs
method, it doesn't setself.orfs_called
toTrue
, so essentially we will have to re-run prodigal every timeself.orfs
is called.In the
get_kmers
method of theMetagenome
class, the lastreturn
line runs ifnormalize_kmers
is set toFalse
. I just think it would improve readability if that last line was in anelse
block. So instead of:https://github.com/KwanLab/Autometa/blob/3e8c7718dc63e5a06654a899506535ae8bd6219e/autometa/common/metagenome.py#L411-L415
Have this:
Also, note in the quoted block above there is an error - the
normalized
variable is not defined. It appears this should benormalized_df
.The
get_coverages
method in theMetagenome
class lacks a docstring.The docstring of the
assign_taxonomy
method should be updated with the arguments currently accepted through*args
and**kwargs
. This method also relies onself.orfs_called
(see above). Also, this method should probably assignself.taxonomy_assigned
when it is done, but it doesn't. This is important because this attribute is checked byget_kingdoms
, and ifFalse
that method will re-runassign_taxonomy
.In the
get_kingdoms
method, it would be useful to include what arguments are currently accepted in**kwargs
in the docstring.The
write_ranks
method seems to duplicate some functionality of theMAG
class, and I'm not sure under what circumstances it would be needed?