Psy-Fer / buttery-eel

The buttery eel - a slow5 guppy/dorado basecaller wrapper
MIT License
34 stars 2 forks source link

Some weird warning in the multi-proc branch #10

Closed hasindu2008 closed 11 months ago

hasindu2008 commented 1 year ago

An int overflow?

command:

eel  --config dna_r10.4.1_e8.2_400bps_hac
.cfg  --device 'cuda:2,3' -i  /home/hasindu/scratch/nanopolish-train/PGXXHX230143_meth_mousePGXXHX230143_reads.blow5 -q 9 -o  PGXXHX230143.fastq

or

eel  --config dna_r10.4.1_e8.2_400bps_hac_prom.cfg  --device 'cuda:all' -i  /home/hasindu/scratch/na12878_prom_lsk114/PGXXHX230142_reads.blow5 -o  na12878_guppy_6.4.2_hac.fastq --qscore 9

========================================================================== Basecalling

processed reads: 292000Traceback (most recent call last): File "/usr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed send_bytes(obj) File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647 processed reads: 744000Traceback (most recent call last): File "/usr/lib/python3.7/multiprocessing/queues.py", line 242, in _feed send_bytes(obj) File "/usr/lib/python3.7/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/usr/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647

Psy-Fer commented 1 year ago

Try using python3.8 or higher. I think this is just an issue with the internal multiprocessing library. So updating python would help with that.

I'll increase the minimum python lib requirement too.

hasindu2008 commented 1 year ago

OK, using Python 3.8 solves the problem, but I still cannot believe how a built-in library in Python can have such an issue. But seems you are right, see this https://github.com/python/cpython/pull/10305/files.

I really wish Python 3.6-3.7 is supported as that is the default snake on Ubuntu 18 for instance.

One suggestion is sending <2GB of data to the process queue at a time, but as I don't know how exactly you have implemented, I cannot tell mcuh.

Psy-Fer commented 1 year ago

Yea, that's possible, but a lot more difficult than requiring an updated python. If someone comes to me with a locked in no way out situation they can't use 3.8, then I'll do it.

Also 18.04 lts runs out this year, so better to use 2020 or 2022 lts, which have 3.8 and above.

hasindu2008 commented 1 year ago

I still use 16.04 :D Anyway, I can easily compile Python 3.8 interpreter than installing certain Python packages thanks to C compilers being more thoughtful in the design. However, there could be users who do not have the expertise to do all of these and may rely on older Pythons, especially in institutional clusters where the systems are usually not too up-to-date.

Given this single-process version has a very less number of issues, do you think we can have those two in the same branch for those who just want to execute it without too much need for performance?

Psy-Fer commented 1 year ago

Hmm. I could do something like

python3 setup_sinlge.py install

vs

python setup.py install

where multiproc is the default setup.

I can also put in a switch for when python is less than 3.8 to only use single...or something like that.

Again, I think it's easier to state the requirements in the readme, and direct people to just install the singleproc branch than do this.

hasindu2008 commented 1 year ago

I think just having a setup_single.py separately and explicitly mentioning in the documentation would suffice.