Division by zero and other edge case errors

baraaorabi commented 5 years ago

There are couple of bugs I encountered which are mostly of the same nature:

No unmapped reads means that read_analysis.py will not generate a training_unaligned_length.pkl file which breaks simulator.py script:


Traceback (most recent call last):
File "extern/nanosim/src/simulator.py", line 739, in <module>
main()
File "extern/nanosim/src/simulator.py", line 723, in main
read_profile(number, model_prefix, perfect)
File "extern/nanosim/src/simulator.py", line 167, in read_profile
kde_unaligned = joblib.load(model_prefix + "_unaligned_length.pkl")
File "/home/borabi/anaconda3/lib/python3.6/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 570, in load
with open(filename, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: 'genes/E2F1/P000R001/training_unaligned_length.pkl'

- No head soft clipping
- No mismatch/ins/del in a read:
Example of the last:
```shell
2019-04-03 16:49:11: match and error models
Traceback (most recent call last):
  File "extern/nanosim/src/read_analysis.py", line 190, in <module>
    main(sys.argv[1:])
  File "extern/nanosim/src/read_analysis.py", line 180, in main
    error_model.hist(prefix, file_extension)
  File "extern/nanosim/src/besthit_to_histogram.py", line 383, in hist
    out_error_rate.write("Mismatch rate:\t" + str(total_mis * 1.0 / (total_mis + total_match + total_del)) + '\n')
ZeroDivisionError: float division by zero

cheny19 commented 5 years ago

Thanks for reporting these errors, we will fix them and push the next release soon.

About the no head soft clipping and error-free read, NanoSim combines the information from all reads in a library, not read by read. So your training all of your reads are perfect? If they are experimental reads, these errors should not occur.

baraaorabi commented 5 years ago

Yes I agree that these division by zero errors are almost impossible to occur in experimental data. But I was feeding nanosim some handmade reads for pure development reasons (reads with indels but no mismatch or vice versa) and that's when nanosim crashed.

On Wed, Apr 3, 2019, 5:43 PM Chen Yang notifications@github.com wrote:

Thanks for reporting these errors, we will fix them and push the next release soon.

About the no head soft clipping and error-free read, NanoSim combines the information from all reads in a library, not read by read. So your training all of your reads are perfect? If they are experimental reads, these errors should not occur.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bcgsc/NanoSim/issues/57#issuecomment-479707710, or mute the thread https://github.com/notifications/unsubscribe-auth/AF_gOQQznZCKEalFVuAubxkecKgzYquoks5vdUqsgaJpZM4cbp-6 .

cheny19 commented 5 years ago

I see. I agree that NanoSim should be more robust and should be able to handle these edge cases. We are working on the next release, so stay tuned!

baraaorabi commented 5 years ago

Awesome. Thanks for the reply.

On Wed, Apr 3, 2019, 7:20 PM Chen Yang notifications@github.com wrote:

I see. I agree that NanoSim should be more robust and should be able to handle these edge cases. We are working on the next release, so stay tuned!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/bcgsc/NanoSim/issues/57#issuecomment-479724684, or mute the thread https://github.com/notifications/unsubscribe-auth/AF_gOQb5gjMj64bG-K_Oh87hBBky9oqrks5vdWF7gaJpZM4cbp-6 .

nick-youngblut commented 5 years ago

I'm getting what seems like the same error with the most up-to-date version of nanosim on bioconda (v2.2.0):

/ebio/abt3_projects/software/dev/llga-sim/.snakemake/conda/73f15e23/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)
Traceback (most recent call last):
  File "/ebio/abt3_projects/software/dev/llga-sim/.snakemake/conda/73f15e23/bin/simulator.py", line 737, in <module>
    main()
  File "/ebio/abt3_projects/software/dev/llga-sim/.snakemake/conda/73f15e23/bin/simulator.py", line 721, in main
    read_profile(number, model_prefix, perfect)
  File "/ebio/abt3_projects/software/dev/llga-sim/.snakemake/conda/73f15e23/bin/simulator.py", line 167, in read_profile
    kde_unaligned = joblib.load(model_prefix + "_unaligned_length.pkl")
  File "/ebio/abt3_projects/software/dev/llga-sim/.snakemake/conda/73f15e23/lib/python3.7/site-packages/joblib/numpy_pickle.py", line 590, in load
    with open(filename, 'rb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/ebio/abt3_projects/databases_no-backup/nanosim/R9/2D/ecoli_unaligned_length.pkl'

cheny19 commented 5 years ago

Hi @nick-youngblut

Could you try our latest version V2.3.0 pre-release? It can be downloaded from Github release page. No installation required.

Thanks, Chen

nick-youngblut commented 5 years ago

Thanks for the quick response! I tried v2.3.0, and I'm getting an error stating that the file "ecoli_strandness_rate" doesn't exist. I'm guessing that I need a new version of the "R9, 2D, ecoli" model, but I'm not sure where to get it from. There's nothing in the README.md that I can find about that. Do I in fact need an updated version of the model? If yes, where can I obtain it?

cheny19 commented 5 years ago

You will have to train your model in this case, because the profiles are not totally compatible with the new version. You can run read_analysis.py -h to learn more about how to train your model.

SaberHQ commented 4 years ago

Dear @nick-youngblut We provided a very comprehensive README file in which you can read about how to train NanoSim with any data. It learns the characteristics of the input read very fast and then you can simulate reads based on those profiles.

Please try the latest versions and let me know if I can be of more help. I am closing this issue for now.

bcgsc / NanoSim

Division by zero and other edge case errors #57