benedictpaten / marginAlign

UCSC Nanopore
MIT License
43 stars 13 forks source link

PySam error with MarginAlign+Minimap2 (last realease) #46

Open tetedange13 opened 5 years ago

tetedange13 commented 5 years ago

Hello,

I am trying to use MarginAlign with the last release of Minimap2 (2.15-r908-dirty) and I had several issues:

1) I had this mysterious PySam error, that has not yet been mentionned in any issues (I think). I did the following sequence of commands to install MarginAlign and get the last release of Minimap2:

git clone https://github.com/benedictpaten/marginAlign.git
cd marginAlign
git pull
git submodule init
git submodule update
make
virtualenv --python=/usr/bin/python2 --no-site-packages --distribute env && source env/bin/activate && pip install -r requirements.txt
make test

Everything went fine, both on the make test and using a toy dataset (20 reads against a small database). But when I used a bigger fastq file (against the same database) an error appeared, apparently coming from the getLastNonClippedPositionInRead function (utils.py file). The alignedSegment.query_alignment_end was crashing and returning a "SystemError: error return without exception set" (I checked the 2 other variables, they seemed fine). I found a PySam issue mentionning a similar thing, but in a different context (https://github.com/pysam-developers/pysam/issues/176), so I fixed this by installing pysam==0.9.0 instead of the version mentionned in the requirement.txt I don't know, maybe I did something wrong, or maybe it is linked with the use of the last version of Minimap2 ?

2) Moreover, trying to use the --em option, to run the EM learning followed by the realignement, I realized that MarginAlign was always loading the HMM file "last_hmm_20.txt". So I had to comment the line options.inputModel = os.path.join(pathToBaseNanoporeDir(), "src", "margin", "mappers", "last_hmm_20.txt") in the marginAlign.py file. I guess this is linked with the fact that LAST is set as the default aligner, but maybe there is something to fix here?

Thanks in advance for your answer and many thanks for implementing Minimap2 into MarginAlign! Felix.

mitenjain commented 5 years ago

Hi Felix,

Couple of thoughts.

  1. Was the virtualenv functional before you had to update pysam?
  2. When you say a large size file, how large are you talking about?
  3. For the hmm file, you should consider training your own model since the stock one is out of data (and needs updating with a newer one). I will try to find a newer R9.4 human trained em model to upload. The hmm file came from last as a legacy but works well with other aligners.
  4. Thank you about minimap2. That is also the least tested aligner for realignment and chaining so some of the error could be due to that. I'll look into this.

If you could share a toy dataset as an MWE for me to reproduce this here, that would be very helpful.

Thank you. Best regards, -Miten

tetedange13 commented 5 years ago

Hi Miten,

Thanks for your very quick answer.

  1. I only had 1 issue creating the virtualenv (with distribute and markerLib https://github.com/pypa/setuptools/issues/535) but this package did not seem to be essential as:

    • The make test worked ;
    • The alignment with Minimap2 (last release) + realignement of my toy dataset against the database (see the part about MWE bellow for details), using the default last_hmm_20.txt file worked ;
    • The training of a new HMM worked until the complete 100 iterations (even if I had this issue mentionned in point 2.).
  2. Not that large actually, so that I was surprised to get that PySam error... As they are reads from experimental Nanopore 1D R9 runs, I don't think that I am allowed to share them. Anyway, I can tell you that the fastq file (not .gz) was about 300 MB large, containing 244 458 reads. You can probably find some others available Nanopore runs, much bigger than mine (you can maybe give a try with this one from this publication where file1 = 730 MB, I did not try it myself), that will give you (I hope) the same error.

  3. Yes that is precisely what I was trying to do: training a new HMM model by running MarginAlign on my own fastq file, with the --em --outputModel output.hmm option! But at the end, the program displayed something like "model used to realign sam file = last_hmm_20.txt", instead of using my generated output.hmm file.

Concerning the MWE you asked for:

Thanks again. Best regards. Felix.