Issue description

Hi all, thanks for great tool and good documentation, so sorry if I am missing something.

I am running CNVkit to create a reference using a mix of 28 female and male control samples. Problem is that coverage part is taking too long, more than 21h, I am getting the message:

/usr/local/lib/python3.9/site-packages/skgenome/intersect.py:11: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import Int64Index [W::cram_populate_ref] Creating reference cache directory /root/.cache/hts-ref This may become large; see the samtools(1) manual page REF_CACHE discussion Processing reads in CTRL-NEUYA376UJ4-03588-G_1.final.cram

Looking at processes, it seems that is working, but I don't know why so long. I used the same workflow for BAM files previously and it worked great, but for these CRAMs, it is taking too long. I checked target.bed and it doesn't contain alternative contigs. I am not sure what to do next. Can you advise?

Libraries used

Package Version

biopython 1.79 CNVkit 0.9.9 contourpy 1.0.5 cycler 0.11.0 fonttools 4.37.3 joblib 0.17.0 kiwisolver 1.4.4 matplotlib 3.6.0 networkx 2.8.6 numexpr 2.8.3 numpy 1.23.3 packaging 21.3 pandas 1.5.0 Pillow 9.2.0 pip 22.2.2 pomegranate 0.14.8 pyfaidx 0.7.1 pyparsing 3.0.9 pysam 0.19.1 python-dateutil 2.8.2 pytz 2022.2.1 PyVCF 0.6.8 PyYAML 6.0 reportlab 3.6.11 scikit-learn 1.0.2 scipy 1.9.1 setuptools 49.2.1 six 1.16.0 threadpoolctl 3.1.0 wheel 0.37.1

OS version

NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.5 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

Best, Luka.

Hi @looxon93,

I do not know CRAM a lot, but as far as I can tell samtools handle reference for CRAM through environment variables (see last paragraph) => And it may automatically try to download correct reference if not found (not sure about that) => So maybe pySAM has a similar behaviour ? => That could explain your runtime, it we consider that the whole human genome is downloaded at least once (maybe multiple times ?)

Here is what you can try :

Check that FASTA you provided to CNVkit was exacty the same used to generate your CRAM files (looking at contigs, compared to CRAM header etc)
Test once again to build a reference from CRAM, but on a few files (like 3) => Maybe you could catch this "download genome situation", by looking at this /root/.cache/hts-ref dir during the process ?
You may also use samtools to download proper reference and give it as input to CNVkit ?

Hope this helped ! Have a nice day, Felix

etal / cnvkit

CNVkit coverage takes too long on CRAM file #759

Issue description

Libraries used

OS version