Teichlab / tracer

TraCeR - reconstruction of T cell receptor sequences from single-cell RNAseq data
Other
122 stars 48 forks source link

Newer python versions and Bio Alphabet #112

Open NathanSiemers opened 3 years ago

NathanSiemers commented 3 years ago

Hello, I'm trying to build a running tracer on a more modern version of python (3.8.10). SInce then, Bio.Alphabet has been removed from python, and the recommendation is that calls to it (IUPAC) can be removed from most code without a problem.

Is it feasible to do this? Any know successes or issues with later versions of python?

Thank you.

File "/usr/local/lib/python3.8/site-packages/tracer-0.5-py3.8.egg/tracerlib/tracer_func.py", line 29, in

from Bio.Alphabet import IUPAC

File "/usr/local/lib/python3.8/site-packages/Bio/Alphabet/init.py", line 20, in

raise ImportError(

ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the \

`molecule_type

NathanSiemers commented 3 years ago

I tested removal of the import calls in init.py and one other file, and tracer loaded correctly, but haven't made a test run.

mstubb commented 3 years ago

Hi Nathan,

Thanks for this! I'll be happy to accept a PR that updates this if you'd like to submit one.

All the best,

Mike

NathanSiemers commented 3 years ago

I've spent several days working on a pull request. I removed the Bio Alphabet dependencies and changed the creating of the Seq objects to remove dependencies on Bio Alphabet IUPAC. I have also have been editing the Dockerfile to update packages to bring everything to a modern version, and also to run the tests. I can send you what I have so far, but: There's an error in the 'tracer test'. It seems that there's still an obscure call to Bio Alphabet in the pickle dump/load that I find difficult to trace. Partially likely because I'm not a python hacker, I can't resolve this one. Some help from the group would be appreciated.

(fragment of tracer test below, I can't find a remaining reference to Bio Alphabet anywhere in the code base.)

Running Kallisto

Making Kallisto indices

[build] loading fasta file /tracer/test_data/results/cell1/expression_quantification/kallisto_index/cell1_transcriptome.fa [build] k-mer length: 31 [build] warning: clipped off poly-A tail (longer than 10) from 654 target sequences [build] warning: replaced 3 non-ACGUT characters in the input sequence with pseudorandom nucleotides [build] counting k-mers ... done. [build] building target de Bruijn graph ... done [build] creating equivalence classes ... done [build] target de Bruijn graph has 781463 contigs and contains 113560426 k-mers

Quantifying with Kallisto

[quant] fragment length distribution will be estimated from the data [index] k-mer length: 31 [index] number of targets: 131,104 [index] number of k-mers: 113,560,426 [index] number of equivalence classes: 460,618 [quant] running in paired-end mode [quant] will process pair 1: /tracer/test_data/cell1_1.fastq /tracer/test_data/cell1_2.fastq [quant] finding pseudoalignments for the reads ... done [quant] processed 1,135 reads, 1,042 reads pseudoaligned [quant] estimated average fragment length: 106.333 [ em] quantifying the abundances ... done [ em] the Expectation-Maximization algorithm ran for 52 rounds

Filtering by read count

Traceback (most recent call last): File "/usr/local/bin/tracer", line 11, in load_entry_point('tracer==0.5', 'console_scripts', 'tracer')() File "/usr/local/lib/python3.7/dist-packages/tracer-0.5-py3.7.egg/tracerlib/launcher.py", line 43, in launch Task().run() File "/usr/local/lib/python3.7/dist-packages/tracer-0.5-py3.7.egg/tracerlib/tasks.py", line 1230, in run loci=['A', 'B'], species='Mmus').run() File "/usr/local/lib/python3.7/dist-packages/tracer-0.5-py3.7.egg/tracerlib/tasks.py", line 766, in run cl = pickle.load(pkl) File "/usr/local/lib/python3.7/dist-packages/Bio/Alphabet/init.py", line 21, in "Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information." ImportError: Bio.Alphabet has been removed from Biopython. In many cases, the alphabet can simply be ignored and removed from scripts. In a few cases, you may need to specify the molecule_type as an annotation on a SeqRecord for your script to work correctly. Please see https://biopython.org/wiki/Alphabet for more information.

NathanSiemers commented 3 years ago

I think the untraceability of the error is due to the Bio Alphabet embedding in the pkl test data reference files in directories like this:

https://github.com/Teichlab/tracer/tree/master/test_data/results/cell2/unfiltered_TCR_seqs

If that's true then the error is due to modern python not being able to load the old reference test results that were pickled.

N

(some text strings from the pkl file below)

S'alphabet'p154g0(cBio.AlphabetHasStopCodonp155g2Ntp156Rp157(dp158S'stop_symbol'p159S'*'p160sg154g0(cBio.Alphabet.IUPACExtendedIUPACProteinp161g2Ntp162Rp163sS'letters'

mstubb commented 3 years ago

Thanks Nathan.

Yes, I think you're right that the error comes from test trying to load the old pickled files that were created with a previous version.

I think that a solution here would be to use an environment with the old BioPython to load those pickled files and then write them out as some kind of parseable text file (not as a pickle).

The pickles are representations of a Cell (https://github.com/Teichlab/tracer/blob/84f53e5ae0211822580be53841fc097fa8694419/tracerlib/core.py#L10) object and its Recombinant (https://github.com/Teichlab/tracer/blob/84f53e5ae0211822580be53841fc097fa8694419/tracerlib/core.py#L298) objects.

These classes aren't very complex so you could write out a text file containing their instance variables.

You could then switch to an environment with the new version of BioPython, recreate the objects using the values in your text file and then repickle them. Those should then be compatible and test should pass.

Cheers,

Mike