Jigyasa3 commented 7 months ago

Dear @Mister-Teapot and @ZoyavanMeel ,

Thank you for a great tool to calculate z-curves! I am getting an error if I use the ORCA.from_string() function. Any suggestions on how to load the sequence fasta file as string for ORCA to recognize it? The NC_000919_Treponema_test.fasta that I am using is a multi-line string of nt. sequences (with no header ">" line). I also tried with converting this file to a single-line string, but I get the same error. Any suggestions?

Code used-

>>>orca=ORCA.from_string("/groups/rubin/projects/jigyasa/ML/results/intergenicregion_find/oriV_annotation/GCskew_plasmidfinder/test/NC_000919_Treponema_test.fasta") >>>orca.find_oriCs(show_info=True)

Error-

Traceback (most recent call last): File "", line 1, in File "/home/jigyasaa/downloads/ORCA/build/lib/orcapy/ORCA.py", line 573, in find_oriCs peaks_of_interest = self.analyse_disparity_curves() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jigyasaa/downloads/ORCA/build/lib/orcapy/ORCA.py", line 370, in analyse_disparity_curves peaks_x = CurveProcessing.process_curve(self.x, 'min', window_size=window_size) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jigyasaa/miniconda3/envs/orca_conda/lib/python3.11/site-packages/orcapy-1.0.1-py3.11.egg/orcapy/CurveProcessing.py", line 14, in process_curve accepted_peaks = filter_peaks(curve, mode, init_peaks) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jigyasaa/miniconda3/envs/orca_conda/lib/python3.11/site-packages/orcapy-1.0.1-py3.11.egg/orcapy/CurveProcessing.py", line 57, in filter_peaks rejected_peaks.extend(_filter_within_windows(curve, mode, peaks)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/jigyasaa/miniconda3/envs/orca_conda/lib/python3.11/site-packages/orcapy-1.0.1-py3.11.egg/orcapy/CurveProcessing.py", line 79, in _filter_within_windows elif mode == 'min' and comparator_win.min() < curve[peak.middle]: ^^^^^^^^^^^^^^^^^^^^ File "/home/jigyasaa/miniconda3/envs/orca_conda/lib/python3.11/site-packages/numpy/core/_methods.py", line 45, in _amin return umr_minimum(a, axis, None, out, keepdims, initial, where) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: zero-size array to reduction operation minimum which has no identity

ZoyavanMeel commented 7 months ago

Hi @Jigyasa3, There actually isn't a constructor for reading FASTA files at the moment, only for GBK files. The sequence parameter in the from_string constructor should already be the DNA string, not a path to a file.

I'll have a look at including a FASTA constructor in the future! But for now there isn't one, sorry.

Jigyasa3 commented 7 months ago

Hi @ZoyavanMeel ,

Thank you so much for replying back to all my queries! I will use the ORCA.from_gbk() for my samples!

BTW, I was comparing the ORCA output for NC_000913 Treponema bacteria with that of doriC, but getting slightly different results. Any suggestions?

bacteria example-

>>> orca = ORCA.from_accession("NC_000919",email=email)
>>> orca.find_oriCs(show_info=True)

The best-scoring potential oriC was found at: 2611 bp. Doric output 1399..1640.

ZoyavanMeel commented 7 months ago

Yea, so DoriC (Ori-Finder) uses only intergenic locations in its final analysis. That is how it's able to give an actual range, rather than just a single point. The exact details of how it does this are unclear to me. Largely because it is closed source and a web service, I made ORCA as an alternative open-source oriC prediction tool. ORCA considers it possible for the oriC to be anywhere on the sequence and will be slightly different because of it. An application note on ORCA by my supervisor and I should be out on biorxiv in a few days. I can send you the link then :)

ZoyavanMeel commented 7 months ago

Here is the application note: https://doi.org/10.1101/2024.03.28.587133 Let me know if you have any more questions

ZoyavanMeel / ORCA

Error in ORCA.from_string() #6

Code used-

Error-

bacteria example-

The best-scoring potential oriC was found at: 2611 bp. Doric output 1399..1640.