[dev branch] Issues with autometa/common/metagenome.py

jason-c-kwan commented 4 years ago

This file contains the definition of the Metagenome class and accompanying methods for handling data. In reviewing it, I noticed the following issues:

In the main function, the --cpus argument should be specified as an int.
In the Metagenome docstring, it would be useful to specify that the attribute size is the size of the assembly in bp.
In the __init__ method of the Metagenome class, a path to a taxonomy table must be supplied. During the normal course of business, this will be generated by other class methods. However, if self.taxonomy_assigned is True, then we load an existing file into a pandas DataFrame and assign it to self.taxonomy. The trouble is, it is unclear to me how self.taxonomy_assigned would ever be set to True when a Metagenome object is created, so this mechanism needs to be refined.
The Metagenome class has a number of methods with the @property decorator, which means they are like attributes which are calculated each time. The trouble is that a number of them rely on the self.sequences property. What this does is that it will parse through the assembly fasta file, and return a list of BioPython SeqIO objects. With a large assembly, this itself takes some time, and self.sequences is run for self.nseqs, self.mean_gc, self.size, self.largest_seq, self.fragmentation_metric (twice), self.describe (seven times!), and length_filter. I think either self.sequences should just load everything into memory, or perhaps if that is deemed to use too much memory, the method could populate all the dependent properties upon first running, and the dependent properties could instead be marked with @cached_property. This is a decorator defined in the functools module, which makes the method remember the return value after the first time. Connected with this issue, I am unclear on the difference between self.sequences and self.seqrecords, and the need for both methods.
In the Metagenome class, it might be better to have the mean_gc property calculate a length-weighted mean GC, because otherwise short contigs will skew the result because they invariably outnumber long contigs, and are less accurate in terms of GC. Perhaps a simple way of calculating this is just to total up the G + C (as well as the appropriate fraction of ambiguous bases), and express as a percentage of the total size.
The orfs_called property in the Metagenome class works out if self.prot_orfs_fpath and self.nucl_orfs_fpath both exist and are non-zero size. This is not a very sophisticated way of determining if the ORFs have been called because there could have been an error half way through the prodigal run, for instance. See also the taxonomy_assigned property.
In the length_filter method of the Metagenome class, the first thing that is done is check that the cutoff argument is a float or int, but it is not done very elegantly:

https://github.com/KwanLab/Autometa/blob/3e8c7718dc63e5a06654a899506535ae8bd6219e/autometa/common/metagenome.py#L266-L271

See this Stackoverflow for discussion on this: https://stackoverflow.com/questions/4187185/how-can-i-check-if-my-python-object-is-a-number

The initial casting of the variable as a float is a problem, because a bool will cast as a float, and also some strings like 'inf' and '-inf'. The Stackoverflow gives some code that can be used to determine the variable is an int or a float but NOT a bool.

The length_filter method of the Metagenome class returns a new Metagenome object after writing a new fasta file with just the contigs with lengths greater than the cutoff argument. However, the new object inherits the same nucl_orfs_fpath and prot_orfs_fpath as the old object. This is fine as long as ORFs have not been called before length_filter, but if done the other way round then the new object will contain ORFs that are not present on the contigs in the new object (because they originate from shorter contigs). Can we guarantee that this will never happen? Because I don't think there are checks for this elsewhere (for instance, the orfs method just parses one of the orf files stored as an attribute). I am just worried that this will be difficult to troubleshoot if it happens. At the very least, a comment should be added somewhere (perhaps so it ends up in the documentation) explaining this potential gotcha.
In the Metagenome class, the orfs method checks whether self.orfs_called is True. If not, then it calls self.call_orfs. However, in the self.call_orfs method, it doesn't set self.orfs_called to True, so essentially we will have to re-run prodigal every time self.orfs is called.
In the get_kmers method of the Metagenome class, the last return line runs if normalize_kmers is set to False. I just think it would improve readability if that last line was in an else block. So instead of:

https://github.com/KwanLab/Autometa/blob/3e8c7718dc63e5a06654a899506535ae8bd6219e/autometa/common/metagenome.py#L411-L415

Have this:

if normalized_kmers:
    normalized_df = kmers.normalize(kmers_df)
    normalized_df.to_csv(normalized_df, sep="\t", header=True, index=True)
    return normalized_df
else:
    return kmers_df

Also, note in the quoted block above there is an error - the normalized variable is not defined. It appears this should be normalized_df.

The get_coverages method in the Metagenome class lacks a docstring.
The docstring of the assign_taxonomy method should be updated with the arguments currently accepted through *args and **kwargs. This method also relies on self.orfs_called (see above). Also, this method should probably assign self.taxonomy_assigned when it is done, but it doesn't. This is important because this attribute is checked by get_kingdoms, and if False that method will re-run assign_taxonomy.
In the get_kingdoms method, it would be useful to include what arguments are currently accepted in **kwargs in the docstring.
The write_ranks method seems to duplicate some functionality of the MAG class, and I'm not sure under what circumstances it would be needed?

evanroyrees commented 4 years ago

In the __init__ method of the Metagenome class, a path to a taxonomy table must be supplied. During the normal course of business, this will be generated by other class methods. However, if self.taxonomy_assigned is True, then we load an existing file into a pandas DataFrame and assign it to self.taxonomy. The trouble is, it is unclear to me how self.taxonomy_assigned would ever be set to True when a Metagenome object is created, so this mechanism needs to be refined.

taxonomy_assigned is a property. This means the value is updated as corresponding values are updated. Therefore, when self.taxonomy_fpath is written, self.taxonomy_assigned will then evaluate to True. We only want to read taxonomy_fpath if the file exists, otherwise we keep this value as None. Would self.taxonomy as a property resolve this?

evanroyrees commented 4 years ago

The Metagenome class has a number of methods with the @property decorator, which means they are like attributes which are calculated each time. The trouble is that a number of them rely on the self.sequences property. What this does is that it will parse through the assembly fasta file, and return a list of BioPython SeqIO objects. With a large assembly, this itself takes some time, and self.sequences is run for self.nseqs, self.mean_gc, self.size, self.largest_seq, self.fragmentation_metric (twice), self.describe (seven times!), and length_filter. I think either self.sequences should just load everything into memory, or perhaps if that is deemed to use too much memory, the method could populate all the dependent properties upon first running, and the dependent properties could instead be marked with @cached_property. This is a decorator defined in the functools module, which makes the method remember the return value after the first time. Connected with this issue, I am unclear on the difference between self.sequences and self.seqrecords, and the need for both methods.

@cached_property is new in version 3.8. We could update for python 3.8. The alternative for python=3.7 has been implemented.

evanroyrees commented 4 years ago

Connected with this issue, I am unclear on the difference between self.sequences and self.seqrecords, and the need for both methods.

self.sequences is a faster parser that does not create SeqRecord objects for each sequence. So in terms of IO, self.sequences is better at retrieving sequences without the need for any header information. self.seqrecords will return a list of SeqIO.SeqRecord objects that have their associated methods for handling these records. The former is more efficient with memory usage and IO, where the latter has more functionality for finer tuned handling of sequences.

evanroyrees commented 4 years ago

the Metagenome class, the orfs method checks whether self.orfs_called is True. If not, then it calls self.call_orfs. However, in the self.call_orfs method, it doesn't set self.orfs_called to True, so essentially we will have to re-run prodigal every time self.orfs is called

This does not occur. self.orfs_called will evaluate to True if the files are written. This is property specific behavior.

evanroyrees commented 4 years ago

In the get_kmers method of the Metagenome class, the last return line runs if normalize_kmers is set to False. I just think it would improve readability if that last line was in an else block. So instead of:

Have this:

if normalized_kmers:
    normalized_df = kmers.normalize(kmers_df)
    normalized_df.to_csv(normalized_df, sep="\t", header=True, index=True)
    return normalized_df
else:
    return kmers_df

Also, note in the quoted block above there is an error - the normalized variable is not defined. It appears this should be normalized_df.

normalized is one of the function arguments. This evaluates the boolean normalized_kmers and is the path to write the dataframe. The code-block you have provided is trying to write a dataframe to a dataframe.

if normalized_kmers:
    normalized_df = kmers.normalize(kmers_df)
    normalized_df.to_csv(normalized, sep="\t", header=True, index=True)
    return normalized_df
else:
    return kmers_df

https://github.com/KwanLab/Autometa/blob/3e8c7718dc63e5a06654a899506535ae8bd6219e/autometa/common/metagenome.py#L369-L379

https://github.com/KwanLab/Autometa/blob/3e8c7718dc63e5a06654a899506535ae8bd6219e/autometa/common/metagenome.py#L401

evanroyrees commented 4 years ago

This method also relies on self.orfs_called (see above). Also, this method should probably assign self.taxonomy_assigned when it is done, but it doesn't

self.orfs_called is a property that will evaluate to True once the ORFs are called. Similarly, self.taxonomy_assigned will evaluate to True once taxonomy is assigned.

evanroyrees commented 4 years ago

The write_ranks method seems to duplicate some functionality of the MAG class, and I'm not sure under what circumstances it would be needed?

This is here in case a user just wants to write out the fastas split at a canonical rank without performing binning. Instead of splitting at the superkingdom level (the default) you can also provide more specific ranks. There have been groups that have used the pipeline for taxonomic assignment and then performed there own binning, so this gives a bit more flexibility. We can remove, move to mag.py or keep it as is.

jason-c-kwan commented 4 years ago

I guess I missed the point of properties 😄

KwanLab / Autometa

[dev branch] Issues with autometa/common/metagenome.py #54