Proposal for Biopython 1.78 compatibility

emollier commented 4 years ago

Greetings,

Since future versions of Debian and Ubuntu are likely to be released with Biopython 1.78 or greater, I have been devising on a patch to address issue #87, which you will find in this merge request.

I tried to make sure the test suite was passing as expected, and it led to a change I'm half happy with; see the FIXME in seqmagick/subcommands/convert.py. Also, some reference data had to be updated due to a change in the Nexus file output in Biopython 1.78.

I have been following recommendations from https://biopython.org/wiki/Alphabet rather naively I believe, so take the commits with an adequate grain of salt, as always.

Have a nice day, :) Étienne.

jgallowa07 commented 4 years ago

Awesome! Thanks so much @emollier. I'll review this later today.

emollier commented 4 years ago

Hi Jared,

Jared Galloway, on 2020-11-10 15:42:37 -0800:

ALPHABETS = {

'dna': Alphabet.generic_dna,

'dna-ambiguous': IUPAC.ambiguous_dna,

'protein': Alphabet.generic_protein,

'rna': Alphabet.generic_rna,

'rna-ambiguous': IUPAC.ambiguous_rna,

'dna': 'DNA',

Doesn't look like this dict is necessary anymore, right?

Agreed, I had a breakage yesterday trying to remove it entirely, but retrying today, my issue seemed unrelated. The ALPHABETS object is in use only to restrict the choice of --alphabet values, for the Nexus output. I suppose moving to a list or removing the --alphabet choice restriction for Nexus files might be adequate options.

Kind Regards, -- Étienne Mollier etienne.mollier@mailoo.org Fingerprint: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da Sent from /dev/pts/4, please excuse my verbosity.

emollier commented 4 years ago

Hi Jared,

Jared Galloway, on 2020-11-11 15:01:43 -0800:

Can you confirm (help me understand) the record.annotations["molecule_type"] Should always be equal to "DNA"? I'm not sure I understand the context here :/

I did some further testing and this annotation turned out to still be important for the Nexus file format. I am not sure how the annotation went to the SeqIO.write initially, but it is not present anymore, so needed to be put back somehow.

From what I read in the Biopython 1.78 source code of the Nexus converter[0], "DNA" is the default datatype. From what I read on this wiki page[1], the Nexus format will accept the following datatypes:

DataType = { standard | DNA | RNA | nucleotide | protein | continuous }

[0] https://github.com/biopython/biopython/blob/biopython-178/Bio/Nexus/Nexus.py#L57 [1] http://wiki.christophchamp.com/index.php?title=NEXUS_file_format

To be safe, I restored the --alphabet to allow one to select one of these datatypes, and the ALPHABETS dictionary, updated to map the existing options with the possible outputs handled by Biopython. Out of the lot of Nexus datatypes, Biopython supports writing "DNA", but also "RNA" and "protein" (case insensitive).

I also added some more tests for Nexus conversions, to see how things were behaving with each of these datatypes, and make sure the behavior was expected. Please let me know if test cases seem consistent to you.

Many thanks for having raised this issue! It could have remained otherwise under the radar.

Kind Regards, -- Étienne Mollier etienne.mollier@mailoo.org Fingerprint: 8f91 b227 c7d6 f2b1 948c 8236 793c f67e 8f0d 11da Sent from /dev/pts/1, please excuse my verbosity.

jgallowa07 commented 4 years ago

Well, this all looks fantastic! Thank you so much @emollier! This was a huge help for me and all other SeqMagik Users!

fhcrc / seqmagick

Proposal for Biopython 1.78 compatibility #89