CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Getting UMI-tools to work with pyston (python replacement) #488

Closed jelber2 closed 3 years ago

jelber2 commented 3 years ago

I tried getting Umi-tools 1.1.2 to work with the python replacement pyston (https://github.com/pyston/pyston), but there seemed to be some snags in dependencies such as pysam. Curious if any umi-tools developers have thought or tried getting umi-tools to work with pyston.

jelber2 commented 3 years ago

Just an update that I at least for one dataset umi_tools whitelist installed with pyston shows at least 3-fold speed up (larger number of reads resulted in ~2x speed up) compared to the same command with conda installation. On a Ubuntu 20.04 machine with root privileges. umi_tools extract showed no real differences

cd ~/bin/
wget https://github.com/pyston/pyston/releases/download/v2.3/pyston_2.3_20.04.deb
sudo dpkg -i pyston_2.3_20.04.deb

cd ~/bin/
git clone https://github.com/CGATOxford/UMI-tools.git
cd ~/bin/UMI-tools
sudo pyston setup.py install

time zcat ../SRR11868220-orig-I1-I2-R1.fastq.gz|head -n 1000000|umi_tools whitelist --extract-method=regex --bc-pattern="(?P<cell_1>.{10})(?P<cell_2>.{10})(?P<umi_1>.{12})" --error-correct-threshold 3 2> whitelist.err 1> whitelist.out
real    0m6.866s
user    0m7.321s
sys 0m1.104s

time zcat ../SRR11868220-orig-I1-I2-R1.fastq.gz|head -n 1000000|umi_tools extract --extract-method=regex --bc-pattern="(?P<cell_1>.{10})(?P<cell_2>.{10})(?P<umi_1>.{12})" --error-correct-cell --whitelist <(grep -v "^#" whitelist.out)  2> extract.err 1>extract.out
real    0m7.188s
user    0m7.670s
sys 0m1.106s

conda activate umi_tools
time zcat ../SRR11868220-orig-I1-I2-R1.fastq.gz|head -n 1000000|umi_tools whitelist --extract-method=regex --bc-pattern="(?P<cell_1>.{10})(?P<cell_2>.{10})(?P<umi_1>.{12})" --error-correct-threshold 3 2> whitelist.err 1> whitelist.out
real    0m19.914s
user    0m20.258s
sys 0m0.516s

time zcat ../SRR11868220-orig-I1-I2-R1.fastq.gz|head -n 1000000|umi_tools extract --extract-method=regex --bc-pattern="(?P<cell_1>.{10})(?P<cell_2>.{10})(?P<umi_1>.{12})" --error-correct-cell --whitelist <(grep -v "^#" whitelist.out)  2> extract.err 1>extract.out
real    0m6.811s
user    0m7.217s
sys 0m0.453s
IanSudbery commented 3 years ago

Hi, I've not come across pyston before, so, no, we've never really considered whether to port across. I can imagine that the biggest hurdle will be pysam - although the pyston website claims full compatability with C extensions (which is what pysam is), it does say that extensions will need recompiling. Unfortunately pysam is a bit of a pain to compile from source, and it wouldn't surprise me if the auto-compilation doesn't work very well.

Your timing results are as I would expect - or at least, i'm not surprised that whitelist is sped up more than extract. Extract is primarily I/O limited - although, its not the physical disk access that limits it, but rather python's inefficient IO routines. However, looking at how pyston works, I wouldn't neccessarily expect IO to benefit.