bcgsc / NanoSim

Nanopore sequence read simulator
Other
217 stars 51 forks source link

a bug in the pre-trained albacore model #110

Closed huzuner closed 3 years ago

huzuner commented 3 years ago

Hello,

I am using NanoSIm v3.0.0 and I would like to report a bug that is caused by the usage of "human_NA12878_DNA_FAB49712_albacore.tar.gz" that is found in the pre-trained_models file. When I simulated human reads using this model, there is something wrong with the aligned fastq files. When I run fastq-validator for one aligned_reads.fasq with:

biopet-validatefastq -i results/nanosim/hum/1_aligned_reads.fastq

I get the following error and it is not possible to do further processing with this fastq. For example, Sourmash always throws the same error with fastq validator when I try to compute signatures.

INFO  [2021-03-17 10:37:50,890] [ValidateFastq$] - Start
Exception in thread "main" htsjdk.samtools.SAMException: Sequence and quality line must be the same length at line 141125 in fastq /vol/compute/hamdiyes_project/Simulation/results/nanosim/hum/1_aligned_reads.fastq
    at htsjdk.samtools.fastq.FastqReader.readNextRecord(FastqReader.java:130)
    at htsjdk.samtools.fastq.FastqReader.next(FastqReader.java:152)
    at htsjdk.samtools.fastq.FastqReader.next(FastqReader.java:43)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at nl.biopet.tools.validatefastq.ValidateFastq$.main(ValidateFastq.scala:52)
    at nl.biopet.tools.validatefastq.ValidateFastq.main(ValidateFastq.scala)

On the other hand, when I use "human_NA12878_DNA_FAB49712_guppy.tar.gz", fastq validator and sourmash do not throw any errors and I have no problem.

It could be the case that the albacore model needs to be re-trained.

In addition, conda installation of NanoSim is also problematic. When I install it via the biconda channel, the simulator.py throws an error at the point "Read KDF of unaligned reads" after the script starts to run. My commands are:

simulator.py genome -rg results/refs/hs_genome.fasta -c resources/human_NA12878_DNA_FAB49712_albacore/training -b albacore --num_threads 2 --fastq -o test.fastq -n 10000

And the error:

Traceback (most recent call last):
  File "/home/uzuner/miniconda3/bin/simulator.py", line 1702, in <module>
    main()
  File "/home/uzuner/miniconda3/bin/simulator.py", line 1599, in main
    read_profile(ref_g, None, number, model_prefix, perfect, args.mode, strandness, None, False, dna_type, None)
  File "/home/uzuner/miniconda3/bin/simulator.py", line 411, in read_profile
    kde_unaligned = joblib.load(model_prefix + "_unaligned_length.pkl")
  File "/home/uzuner/miniconda3/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 585, in load
    obj = _unpickle(fobj, filename, mmap_mode)
  File "/home/uzuner/miniconda3/lib/python3.8/site-packages/joblib/numpy_pickle.py", line 504, in _unpickle
    obj = unpickler.load()
  File "/home/uzuner/miniconda3/lib/python3.8/pickle.py", line 1210, in load
    dispatch[key[0]](self)
  File "/home/uzuner/miniconda3/lib/python3.8/pickle.py", line 1526, in load_global
    klass = self.find_class(module, name)
  File "/home/uzuner/miniconda3/lib/python3.8/pickle.py", line 1577, in find_class
    __import__(module, level=0)
ModuleNotFoundError: No module named 'sklearn.neighbors.kde'

I think this is related to an issue that I mentioned before in a previous issue.

Thank you, Hamdiye

SaberHQ commented 3 years ago

Thanks @huzuner for reporting these issues. As for the conda installation, I guess it is not up to date. For now, I suggest cloning from Github and using the latest committed version. We will take a look at conda installation and update it as well.

As for the first issue you reported here, I will leave it to @cheny19 to comment on that.

cheny19 commented 3 years ago

Hi @huzuner , sorry for the late reply. I have tried both the pre-trained albacore model and the guppy model, and both of them produced sequences with the same lengths as the quality scores. In fact, the pre-trained models do not contain any information about the quality simulation. So theoretically the pre-trained models shouldn't affect the quality simulation. We had the unequal length bug before, but this was resolved before v3.0.0. Could you provide more information about your command, so I can try to reproduce this error?

As for the condo install, I have just updated the requirements.txt file so it should solve this problem. Please try the latest release and see how that goes. If not, you can also try to clone the Github repo and use conda install --file requirements.txt for dependencies.