BIMSBbioinfo / janggu

Deep learning infrastructure for genomics
GNU General Public License v3.0
254 stars 33 forks source link

Weird one-hot encode from sequences stored as biopython SeqRecord #14

Closed PedroBarbosa closed 4 years ago

PedroBarbosa commented 4 years ago

Hi again,

Just a strange behaviour I just observed.

My sequences don't have any N, but still I'm getting [0 0 0 0] nucleotide features at some positions in the sequence (G and Ts).

print(_data[0])
Seq('AAAGGAAAATCTAAATTTGTCCAGAATCAGATGGGCAGATAGTGAAAAGGCGAG...GTG', <class 'Bio.Alphabet.IUPAC.IUPACUnambiguousDNA'>)

print(_data[0].seq)
AAAGGAAAATCTAAATTTGTCCAGAATCAGATGGGCAGATAGTGAAAAGGCGAGGATTTCAAACAAACTCGGTGTCTGTGACAGTCTTGCCAGAACCTGTG

_seqs = Bioseq.create_from_seq('seqs', fastafile=_data, order=1)

print(_seqs[0])

[[[[1 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[0 1 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 1 0 0]]

  [[0 1 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[0 1 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 1 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 1 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 1 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[0 1 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[0 1 0 0]]

  [[0 0 0 0]]

  [[0 1 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 1 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[0 1 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 1 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 1 0 0]]

  [[0 1 0 0]]

  [[1 0 0 0]]

  [[0 0 0 0]]

  [[1 0 0 0]]

  [[1 0 0 0]]

  [[0 1 0 0]]

  [[0 1 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]

  [[0 0 0 0]]]]

Sorry for throwing you with so many issues, maybe that's something I did wrong here. Best, Pedro

wkopp commented 4 years ago

Hi Pedro,

I've just tried to reproduce this issue, but I wasn't able to.

I've tried two things:

  1. I created a test.fa file with the sequence above
>testseq
AAAGGAAAATCTAAATTTGTCCAGAATCAGATGGGCAGATAGTGAAAAGGCGAGGATTTCAAACAAACTCGGTGTCTGTGACAGTCTTGCCAGAACCTGTG

which I've tried to load via

from janggu.data import Bioseq

data  = Bioseq.create_from_seq('dna', 'test.fa')

data[0]
array([[[[1, 0, 0, 0]],

        [[1, 0, 0, 0]],

        [[1, 0, 0, 0]],

        [[0, 0, 1, 0]],

        [[0, 0, 1, 0]],

        [[1, 0, 0, 0]],

        [[1, 0, 0, 0]],

        [[1, 0, 0, 0]],

        [[1, 0, 0, 0]],

        [[0, 0, 0, 1]],

        [[0, 1, 0, 0]],
...

This appears to work.

  1. I've tried to create the Bioseq object from a SeqRecord
from Bio.SeqRecord import SeqRecord
from janggu.utils import sequences_from_fasta

seqs = sequences_from_fasta('test.fa', 'dna')
seqs
[SeqRecord(seq=Seq('AAAGGAAAATCTAAATTTGTCCAGAATCAGATGGGCAGATAGTGAAAAGGCGAG...GTG', IUPACUnambiguousDNA()), id='testseq', name='testseq', description=' testseq', dbxrefs=[])]
dna2 = Bioseq.create_from_seq('dna', seqs)
dna2[0]
array([[[[1, 0, 0, 0]],

        [[1, 0, 0, 0]],

        [[1, 0, 0, 0]],

        [[0, 0, 1, 0]],

        [[0, 0, 1, 0]],

        [[1, 0, 0, 0]],

        [[1, 0, 0, 0]],

        [[1, 0, 0, 0]],

        [[1, 0, 0, 0]],

        [[0, 0, 0, 1]],

        [[0, 1, 0, 0]],

        [[0, 0, 0, 1]],
...

This also seems to work.

I was using Biopython==1.77 and the current janggu master-branch.

Could you check if this works for you?

Best, Wolfgang

PedroBarbosa commented 4 years ago

Hi Wolfgang,

Could you check if this works for you?

I did make both tests, and they work fine (I'm also using the same versions)

I figured what the problem was: I don't write fasta sequences to disk, so my sequences are all stored in dataframes. I had created a method that converts the sequences stored in a given column of the df to BioSeq records, and apparently I was giving the wrong alphabet IUPAC value:


list(seqs_df.apply(lambda x: SeqRecord(Seq(x[seq_col], iupac), id=x[header_col]), axis=1))

where iupac used to be IUPAC.IUPACUnambiguousDNA and now I changed to IUPAC.unambiguous_dna

This way it works. Thanks, Pedro