Different sequence lengths than when simply parsing fasta within python

blaiseli commented 7 years ago

I just tried the fasta parsing feature of pyGATB, and I find surprizing results. See the following ipython3 session:

In [1]: from gatb import Bank

In [2]: sum(len(seq.sequence) for seq in Bank("/Genomes/C_elegans/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/WholeGenomeFasta/genome.fa"))
Out[2]: 102932909

In [3]: sum((len(line.strip()) for line in open("/Genomes/C_elegans/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/WholeGenomeFasta/genome.fa") if line[0] != ">"))
Out[3]: 100286401

What could explain the differences ?

Edit: Same issue with the provided example:

In [1]: from gatb import Bank

In [2]: sum(len(seq.sequence) for seq in Bank("/home/bli/src/pyGATB/thirdparty/gatb-core/gatb-core/test/db/query.fa"))
Out[2]: 40428

In [3]: sum((len(line.strip()) for line in open("/home/bli/src/pyGATB/thirdparty/gatb-core/gatb-core/test/db/query.fa") if line[0] != ">"))
Out[3]: 39699

blaiseli commented 7 years ago

I did more investigation, comparing with Biopython, and it seems that the first sequence is skipped with pyGATB, and that the last one is duplicated:

Loading the sequences:

biopython_sequences = [str(rec.seq) for rec in SeqIO.parse("/home/bli/src/pyGATB/thirdparty/gatb-core/gatb-core/test/db/query.fa", "fasta")
gatb_sequences = [seq.sequence.decode("utf-8") for seq in Bank("/home/bli/src/pyGATB/thirdparty/gatb-core/gatb-core/test/db/query.fa")]

We have 71 sequences in both cases:

len(biopython_sequences)
len(gatb_sequences)

(This returns 71)

all(map(lambda pair : pair[0] == pair[1], zip(biopython_sequences[1:], gatb_sequences)))

This returns True, so all sequences are the same, but shifted by one.

What is the last gatb sequence then?

It is not the first biopython sequence (otherwise the sum of lengths in my previous message would have been the same):

last_gatb = gatb_sequences[-1]
biopython_sequences[0] == last_gatb

(This returns False)

Actually, the last gatb sequence is a duplication of the previous one:

gatb_sequences[-2] == last_gatb returns True

blaiseli commented 7 years ago

I'm not familiar with gatb and I'm not experienced in interfacing C/C++ code with python, but my intuition tells me that this piece of code could be responsible for the first sequence to be skipped:

https://github.com/GATB/pyGATB/blob/master/src/bank.pyi#L45

it.thisptr.first()

genscale-admin commented 7 years ago

Hi, thanks for providing us with this detailed issue. We'll look at that and come back to you as soon as possible.

genscale-admin commented 7 years ago

Hi,

thanks for providing us with this detailed issue on pyGATB. We'll look at that and come back to you as soon as possible.

Patrick

----- Mail original -----

De: "blaiseli" notifications@github.com À: "GATB/pyGATB" pyGATB@noreply.github.com Cc: "Subscribed" subscribed@noreply.github.com Envoyé: Mercredi 7 Juin 2017 11:41:40 Objet: [GATB/pyGATB] Different sequence lengths than when simply parsing fasta within python (#2)

I just tried the fasta parsing feature of pyGATB, and I find surprizing results. See the following ipython3 session: In [1]: from gatb import Bank

In [2]: sum(len(seq.sequence) for seq in Bank("/Genomes/C_elegans/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/WholeGenomeFasta/genome.fa")) Out[2]: 102932909

In [3]: sum((len(line.strip()) for line in open("/Genomes/C_elegans/Caenorhabditis_elegans/Ensembl/WBcel235/Sequence/WholeGenomeFasta/genome.fa") if line[0] != ">")) Out[3]: 100286401

What could explain the differences ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub , or mute the thread .

Piezoid commented 7 years ago

Thanks for the report. This should be fixed now.

I've just pushed 0.1.2 for linux on PyPI, the macOS package is available on the wiki.

blaiseli commented 7 years ago

Thanks for your quick reaction. I haven't tested yet.

I have a side question. I can't seem to install from PyPI using pip:

$ pip3.6 install pyGATB --upgrade
Requirement already up-to-date: pyGATB in /home/bli/.local/lib/python3.6/site-packages/pyGATB-0.1.1-py3.6-linux-x86_64.egg

Any clue?

(I had installed 0.1.1 git cloning and using cmake, make and setup.py)

blaiseli commented 7 years ago

I updated to 0.1.2 through the github way, and it seems to work, thanks: https://bioinformatics.stackexchange.com/a/380/292

Piezoid commented 7 years ago

Are you on linux ? The package for macOS is not on PyPI yet.

On my machine the upgrade is working:

Collecting pyGATB
  Downloading pyGATB-0.1.2-cp36-cp36m-manylinux1_x86_64.whl (3.9MB)
    100% |████████████████████████████████| 3.9MB 346kB/s 
Installing collected packages: pyGATB
  Found existing installation: pyGATB 0.1.1
    Uninstalling pyGATB-0.1.1:
      Successfully uninstalled pyGATB-0.1.1
Successfully installed pyGATB-0.1.2

PS: I'm glad to learn that pyGATB is almost the fastest in your benchmarks. We haven't done any benchmarking on the wrapper yet. Also, the Bank interface is barely tested.

I've added a free list for faster allocations of Sequences, maybe it could help.

blaiseli commented 7 years ago

Yes, I'm on linux (Xubuntu 16.04), but I have a manually compiled python3.6 installation. I don't think this is relevant, since I can usually install packages using pip3.6. Maybe the problem is that I had installed pyGATB "manually" in the first place. Or pip maintains a cache somewhere, which hasn't been updated.

I just tried the latest github version where you added this "free list" thing, and I don't notice a notable difference in speed. I'm parsing a well assembled genome with few chromosomes in my test, so the file has long sequences, but just a few. This is probably not the typical use case. I guess that more relevant benchmarks would use high-throughput sequencing data in fastq.gz format.

Piezoid commented 7 years ago

Yes you're right, I wrongly assumed that you were using sequencing data.

One minor observation: you can avoid to decode the sequence

print(sum((seq.sequence.count(b"A") for seq in bank)))

For the same reason, command line text tools like grep perform better with LANG=C than with an utf-8 locale.

blaiseli commented 7 years ago

Thanks for the tip, this seems to bring a slight improvement. I updated my benchmark on bioinformatics SE.

blaiseli commented 7 years ago

I see that someone mentioned the stackexchange benchmarks here: http://gatb.inria.fr/pygatb-speed-test/.

pyGATB is indeed fast compared to the other python implementations I tested, but there are other non-python solutions in the other answers that are likely quite faster (in particular the currently accepted one: https://bioinformatics.stackexchange.com/a/369/292), so overall, it is not accurate to call pyGATB the "second fastest tool".

By the way, I added fastq.gz tests, and pyGATB is the best of the three python implementations I tried (but not by much compared with the readfq pure python module).

What I would be curious to see tested is the native C++ version of GATB, against the C version of readfq and against SeqAn. Unfortunately, I'm not a C/C++ programmer. If you or some of your colleagues could try that and post an answer to the stackexchange question, it would be great.

GATB / pyGATB

Different sequence lengths than when simply parsing fasta within python #2