davideyre / runListCompare

Other
0 stars 4 forks source link

Several errors related to deprecated Biopython modules, etc. #2

Closed aedecano closed 2 years ago

aedecano commented 2 years ago

First issue was when I had to create multi-level/nested output directories on CLI as it won't automatically get generated by running the code. The rest are due to deprecated biopython modules hence the error messages below.

python3 runListCompare.py tests/data/ec/ec.ini

Checking percentage ACGT in samples
Checking percentage ACGT in: tests/data/ec/75dadb6d-7859-4012-bbf5-3f907fbe24c9.fa.gz
Checking percentage ACGT in: tests/data/ec/76cd0000-bf91-4920-8e86-60f840a0d162.fa.gz
Checking percentage ACGT in: tests/data/ec/02080367-872a-47fa-9412-a3eefb7cfb86.fa.gz

Generate all vs all alignment using 1 cores
Proceeding without mask file
python mtAlign.py -p 1 tests/output/ec/clean_seqlist.txt tests/data/ec/R00000042.fasta tests/output/ec/align
Traceback (most recent call last):
  File "mtAlign.py", line 11, in <module>
    from Bio.Alphabet import generic_dna
  File "/home/ubuntu/anaconda3/envs/runlistcompare/lib/python3.7/site-packages/Bio/Alphabet/__init__.py", line 21, in <module>
    "Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information."
ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the ``molecule_type`` as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

I removed the Bio.Alphabet module in all the scripts and the 1st task of creating an all vs all alignment was completed but was met with another error:

Traceback (most recent call last):
  File "mtAlign.py", line 157, in <module>
    SeqIO.write( seqlist , '%s_snps.fa'%outname_prefix, 'fasta' )
  File "/home/ubuntu/anaconda3/envs/runlistcompare/lib/python3.7/site-packages/Bio/SeqIO/__init__.py", line 518, in write
    fp.write(format_function(record))
  File "/home/ubuntu/anaconda3/envs/runlistcompare/lib/python3.7/site-packages/Bio/SeqIO/FastaIO.py", line 389, in as_fasta
    data = _get_seq_string(record)  # Catches sequence being None
  File "/home/ubuntu/anaconda3/envs/runlistcompare/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 110, in _get_seq_string
    return str(record.seq)
  File "/home/ubuntu/anaconda3/envs/runlistcompare/lib/python3.7/site-packages/Bio/Seq.py", line 326, in __str__
    return self._data.decode("ASCII")
AttributeError: 'str' object has no attribute 'decode'

This was resolved by modifying the ~/anaconda3/envs/runlistcompare/lib/python3.7/site-packages/Bio/Seq.py script and commenting out all ".decode()" functions.

However, I hit a wall when this error came up for generating clusters and clean alignments:

Generate clusters and clean alignments
python clusterCreator.py -s 10000  tests/output/ec/initial_nodes.txt tests/output/ec/align-compare.txt tests/output/ec/clusters.txt
Removing excluded nodes
python getClusterAlign.py -p 1 -s 0.7 -v 0.7 -n 0 tests/output/ec/clean_seqlist.txt tests/output/ec/clusters.txt tests/data/ec/R00000042.fasta tests/output/ec
Proceeding without mask file
python getAlignment.py tests/output/ec/cluster/cluster_1.txt tests/data/ec/R00000042.fasta tests/output/ec/cluster/cluster_1
Successfully read in reference.
tests/data/ec/76cd0000-bf91-4920-8e86-60f840a0d162.fa.gz
tests/data/ec/75dadb6d-7859-4012-bbf5-3f907fbe24c9.fa.gz
tests/data/ec/02080367-872a-47fa-9412-a3eefb7cfb86.fa.gz
Successfully read in sequences in 0.25903166690841317 seconds.
Successfully obtained masked nonshared_diffs; there are 4687 of them.
Successfully wrote snps fasta file.
Successfully wrote nonshared positions.
Successfully completed alignment in 9.268191209994256 seconds.
Cleaning tests/output/ec/cluster/cluster_1
Traceback (most recent call last):
  File "cleanAlignment.py", line 103, in <module>
    called = len([b for b in s.seq if b in bases])
  File "cleanAlignment.py", line 103, in <listcomp>
    called = len([b for b in s.seq if b in bases])
  File "/home/ubuntu/anaconda3/envs/runlistcompare/lib/python3.7/site-packages/Bio/Seq.py", line 430, in __getitem__
    return chr(self._data[index])
TypeError: an integer is required (got type str)

I tried changing chr() to str() on line 430 of ~/anaconda3/envs/runlistcompare/lib/python3.7/site-packages/Bio/Seq.py but it resulted in outputting either empty files (e.g. align_snps.fa) or empty cluster/ and cluster_ml/ folders.

There seems to be a need for further modification of the scripts, I just couldn't seem to identify the specific lines to change at this point.

bede commented 2 years ago

Thanks Arun, these issues are due to deprecations in recent BioPython versions. Tests are now passing with revised install instructions in https://github.com/davideyre/runListCompare/commit/f950d9140636148430c5b4b959b87d4c8a8f2bca