Closed PedroBarbosa closed 4 years ago
Hi Pedro,
I've just tried to reproduce this issue, but I wasn't able to.
I've tried two things:
test.fa
file with the sequence above>testseq
AAAGGAAAATCTAAATTTGTCCAGAATCAGATGGGCAGATAGTGAAAAGGCGAGGATTTCAAACAAACTCGGTGTCTGTGACAGTCTTGCCAGAACCTGTG
which I've tried to load via
from janggu.data import Bioseq
data = Bioseq.create_from_seq('dna', 'test.fa')
data[0]
array([[[[1, 0, 0, 0]],
[[1, 0, 0, 0]],
[[1, 0, 0, 0]],
[[0, 0, 1, 0]],
[[0, 0, 1, 0]],
[[1, 0, 0, 0]],
[[1, 0, 0, 0]],
[[1, 0, 0, 0]],
[[1, 0, 0, 0]],
[[0, 0, 0, 1]],
[[0, 1, 0, 0]],
...
This appears to work.
from Bio.SeqRecord import SeqRecord
from janggu.utils import sequences_from_fasta
seqs = sequences_from_fasta('test.fa', 'dna')
seqs
[SeqRecord(seq=Seq('AAAGGAAAATCTAAATTTGTCCAGAATCAGATGGGCAGATAGTGAAAAGGCGAG...GTG', IUPACUnambiguousDNA()), id='testseq', name='testseq', description=' testseq', dbxrefs=[])]
dna2 = Bioseq.create_from_seq('dna', seqs)
dna2[0]
array([[[[1, 0, 0, 0]],
[[1, 0, 0, 0]],
[[1, 0, 0, 0]],
[[0, 0, 1, 0]],
[[0, 0, 1, 0]],
[[1, 0, 0, 0]],
[[1, 0, 0, 0]],
[[1, 0, 0, 0]],
[[1, 0, 0, 0]],
[[0, 0, 0, 1]],
[[0, 1, 0, 0]],
[[0, 0, 0, 1]],
...
This also seems to work.
I was using Biopython==1.77 and the current janggu master-branch.
Could you check if this works for you?
Best, Wolfgang
Hi Wolfgang,
Could you check if this works for you?
I did make both tests, and they work fine (I'm also using the same versions)
I figured what the problem was: I don't write fasta sequences to disk, so my sequences are all stored in dataframes. I had created a method that converts the sequences stored in a given column of the df to BioSeq records, and apparently I was giving the wrong alphabet IUPAC value:
list(seqs_df.apply(lambda x: SeqRecord(Seq(x[seq_col], iupac), id=x[header_col]), axis=1))
where iupac
used to be IUPAC.IUPACUnambiguousDNA
and now I changed to IUPAC.unambiguous_dna
This way it works. Thanks, Pedro
Hi again,
Just a strange behaviour I just observed.
My sequences don't have any
N
, but still I'm getting [0 0 0 0] nucleotide features at some positions in the sequence (G and Ts).Sorry for throwing you with so many issues, maybe that's something I did wrong here. Best, Pedro