etal / cnvkit

Copy number variant detection from targeted DNA sequencing
http://cnvkit.readthedocs.org
Other
502 stars 163 forks source link

CNVkit coverage takes too long on CRAM file #759

Open looxon93 opened 1 year ago

looxon93 commented 1 year ago

Issue description

Hi all, thanks for great tool and good documentation, so sorry if I am missing something.

I am running CNVkit to create a reference using a mix of 28 female and male control samples. Problem is that coverage part is taking too long, more than 21h, I am getting the message:

/usr/local/lib/python3.9/site-packages/skgenome/intersect.py:11: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead. from pandas import Int64Index [W::cram_populate_ref] Creating reference cache directory /root/.cache/hts-ref This may become large; see the samtools(1) manual page REF_CACHE discussion Processing reads in CTRL-NEUYA376UJ4-03588-G_1.final.cram

Looking at processes, it seems that is working, but I don't know why so long. I used the same workflow for BAM files previously and it worked great, but for these CRAMs, it is taking too long. I checked target.bed and it doesn't contain alternative contigs. I am not sure what to do next. Can you advise?

Libraries used

Package Version


biopython 1.79 CNVkit 0.9.9 contourpy 1.0.5 cycler 0.11.0 fonttools 4.37.3 joblib 0.17.0 kiwisolver 1.4.4 matplotlib 3.6.0 networkx 2.8.6 numexpr 2.8.3 numpy 1.23.3 packaging 21.3 pandas 1.5.0 Pillow 9.2.0 pip 22.2.2 pomegranate 0.14.8 pyfaidx 0.7.1 pyparsing 3.0.9 pysam 0.19.1 python-dateutil 2.8.2 pytz 2022.2.1 PyVCF 0.6.8 PyYAML 6.0 reportlab 3.6.11 scikit-learn 1.0.2 scipy 1.9.1 setuptools 49.2.1 six 1.16.0 threadpoolctl 3.1.0 wheel 0.37.1

OS version

NAME="Ubuntu" VERSION="20.04.5 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.5 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

Best, Luka.

tetedange13 commented 1 year ago

Hi @looxon93,

I do not know CRAM a lot, but as far as I can tell samtools handle reference for CRAM through environment variables (see last paragraph) => And it may automatically try to download correct reference if not found (not sure about that) => So maybe pySAM has a similar behaviour ? => That could explain your runtime, it we consider that the whole human genome is downloaded at least once (maybe multiple times ?)

Here is what you can try :

  1. Check that FASTA you provided to CNVkit was exacty the same used to generate your CRAM files (looking at contigs, compared to CRAM header etc)
  2. Test once again to build a reference from CRAM, but on a few files (like 3) => Maybe you could catch this "download genome situation", by looking at this /root/.cache/hts-ref dir during the process ?
  3. You may also use samtools to download proper reference and give it as input to CNVkit ?

Hope this helped ! Have a nice day, Felix