BGSpiegl / GCparagon

commandline tool for fast computation and correction of GC biases in WGS DNA datasets from liquid biopsy samples taking the fragment length into account
MIT License
6 stars 3 forks source link

File "/opt/conda/envs/GCparagon/lib/python3.10/site-packages/GCparagon/correct_GC_bias.py", line 3069, in main raise AttributeError(f"2bit reference genome file '{two_bit_reference_file}' does not exist!") #11

Closed mariaalexandrastanciu closed 1 week ago

mariaalexandrastanciu commented 2 weeks ago

Hi,

I am trying to run GCparagon with singularity and I get the following error:

Traceback (most recent call last): File "/opt/conda/envs/GCparagon/bin/gcparagon", line 8, in sys.exit(main()) File "/opt/conda/envs/GCparagon/lib/python3.10/site-packages/GCparagon/correct_GC_bias.py", line 3069, in main raise AttributeError(f"2bit reference genome file '{two_bit_reference_file}' does not exist!") AttributeError: 2bit reference genome file 'resources/hg38/gcpropagon' does not exist!

My call: singularity run gcparagon.sif -b healthy_sWGS/bams/Genome-IJB-HP-10-xx_S81.markDup.bam -rtb resources/hg38/gcpropagon/hg38_analysisSet.2bit -c resources/hg38/gcpropagon/hg38_minimalExclusionListOverlap_1Mbp_intervals_33pcOverlapLimited.FGCD.bed -rgcd resources/hg38/gcpropagon/hg38_reference_GC_content_distribution.tsv -rgb hg38

I have the file in the specified location for sure: user@lm4-f001 ~]$ ls /globalscratch/ulb/bctr/astanciu/resources/hg38/gcpropagon/hg38_analysisSet.2bit /globalscratch/ulb/bctr/astanciu/resources/hg38/gcpropagon/hg38_analysisSet.2bit

I have created the singularity image using the def script you provided.

Can you help me with this issue?

Thanks, Alexandra

BGSpiegl commented 2 weeks ago

Hi Alexandra,

I am sorry that you experience problems with the GCparagon singularity image. I will try to test it locally and understand it better. Do you get the same error (file not found) if you don't specify the paths to the reference files (i.e., none of what you provided above except for -b healthy_sWGS/bams/Genome-IJB-HP-10-xx_S81.markDup.bam since -rgb hg38 is default)?

BR Benjamin

mariaalexandrastanciu commented 2 weeks ago

The error message is different, but the problem is the same. It seems that it still cannot find the file, but it looks for it in the default location.

See below:

$ singularity run gcparagon.sif -b healthy_sWGS/bams/Genome-IJB-HP-10-xx_S81.markDup.bam cannot proceed - no two-bit reference file defined and default expected file not present under {'hg38': PosixPath('/opt/conda/envs/GCparagon/lib/python3.10/site-packages/GCparagon/2bit_reference/hg38.analysisSet.2bit'), 'hg19': PosixPath('/opt/conda/envs/GCparagon/lib/python3.10/site-packages/GCparagon/2bit_reference/hg19.2bit')}. Terminating ..

BGSpiegl commented 2 weeks ago

I guess you built the singularity image yourself from the current repo (is this correct?). Have you tried downloading and using the pre-built singularity image using: singularity pull --arch amd64 library://bgspiegl/gcparagon/gcparagon-ubuntu-22_04-container:latest && singularity verify gcparagon-ubuntu-22_04-container_latest.sif? Unfortunately, I haven't tested the latest changes with singularity.

mariaalexandrastanciu commented 2 weeks ago

yes, I built the image myself from the current repo.

If I try the pre-built image I get the following error:

singularity pull --arch amd64 library://bgspiegl/gcparagon/gcparagon-ubuntu-22_04-container:latest FATAL: Unable to get library client configuration: remote has no library client (see https://apptainer.org/docs/user/latest/endpoint.html#no-default-remote)

BGSpiegl commented 1 week ago

Ah, I totally missed that Singularity was split into Singularity CE and Apptainer after Gregory Kurtzer left Sylabs in 2020. You should be able to fix the problem by running the commands that are provided under the URL of your last message: apptainer remote add --no-login SylabsCloud cloud.sycloud.io and apptainer remote use SylabsCloud and apptainer remote list (which should list ''SylabsCloud' among your list of available remotes) The command singularity remote list should also list SylabsCloud. What does it show for you? After that the minimal pull command singularity pull library://bgspiegl/gcparagon/gcparagon-ubuntu-22_04-container:latest should work.

I need more information to give you any reasonable support here. I tested the most simple pull command singularity pull library://bgspiegl/gcparagon/gcparagon-ubuntu-22_04-container:latest on another machine and it worked fine. Which OS are you on? What is the version of your singularity? (run singularity --version from within the GCparagon conda env; the version of my singularity from the GCparagon conda is 3.8.6) [edited - Apptainer/SingularityCE split]

BGSpiegl commented 1 week ago

Concerning the file not foundproblem - not every location is accessible for the singularity image. You would have to use paths under, e.g., $HOME for Apptainer/SingularityCE to actually be able to find the files (I am also a friend of absolute paths here). If you want to use another directory permanently for GCparagon, you can always mount it as described in the Apptainer docs: https://apptainer.org/docs/user/latest/quick_start.html#working-with-files

Example of the 2bit reference files in the singularity image: For now, the hg38 2bit reference genome file is downloaded to /opt/github/GCparagon/src/GCparagon/2bit_reference/hg38.analysisSet.2bit but the program expects to find it under /opt/conda/envs/GCparagon/lib/python3.10/site-packages/GCparagon/2bit_reference/hg38.analysisSet.2bit after pip install. You should be able to get the anticipated behaviour by passing the absolute paths under /opt/github/GCparagon/src/GCparagon/ where the files are actually located. I will look into how to get a consistent behaviour irrespective of whether the user used the pip install command (also in case of using the singularity image) or not.

mariaalexandrastanciu commented 1 week ago

Hi,

After running the apptainer commands you suggested I get the following error: [astanciu@lm4-f001 ~]$ singularity pull --arch amd64 library://bgspiegl/gcparagon/gcparagon-ubuntu-22_04-container:latest FATAL: While pulling library image: error fetching image: error making request to server: Get "https://library.sylabs.io/v1/images/bgspiegl/gcparagon/gcparagon-ubuntu-22_04-container:latest?arch=amd64": dial tcp: lookup library.sylabs.io on 192.168.254.91:53: no such host

And this is the list I get: [astanciu@lm4-f001 ~]$ singularity remote list

NAME URI DEFAULT? GLOBAL? EXCLUSIVE? SECURE? DefaultRemote cloud.apptainer.org ✓ ✓ SylabsCloud cloud.sycloud.io ✓ ✓

I am working on a Linux OS cluster and my singularity version is: apptainer version 1.3.0-1.el8

mariaalexandrastanciu commented 1 week ago

Hi,

It eventually worked to download the image, but I get the same issue:

singularity run gcparagon-ubuntu-22_04-container_latest.sif -b /healthy_sWGS/bams/Genome-IJB-HP-10-xx_S81.markDup.bam cannot proceed - no two-bit reference file defined and default expected file not present under {'hg38': PosixPath('/opt/conda/envs/GCparagon/lib/python3.10/site-packages/GCparagon/2bit_reference/hg38.analysisSet.2bit'), 'hg19': PosixPath('/opt/conda/envs/GCparagon/lib/python3.10/site-packages/GCparagon/2bit_reference/hg19.2bit')}. Terminating ..

BGSpiegl commented 1 week ago

Does it find the required hg38 files if you specify them like this: --two-bit-reference-genome /opt/github/GCparagon/src/GCparagon/2bit_reference/hg38.analysisSet.2bit --intervals-bed /opt/github/GCparagon/src/GCparagon/accessory_files/hg38_minimalExclusionListOverlap_1Mbp_intervals_33pcOverlapLimited.FGCD.bed --reference-gc-content-distribution-table /opt/github/GCparagon/src/GCparagon/accessory_files/accessory_files/hg38_reference_GC_content_distribution.tsv [Edited: exchanged --exclude-intervals with --intervals-bed]

mariaalexandrastanciu commented 1 week ago

It seems that it moved on to the next file:

singularity run gcparagon.sif -b /globalscratch/ulb/bctr/astanciu/healthy_sWGS/bams/Genome-IJB-HP-10-xx_S81.markDup.bam --two-bit-reference-genome /opt/github/GCparagon/src/GCparagon/2bit_reference/hg38.analysisSet.2bit --exclude-intervals /opt/github/GCparagon/src/GCparagon/accessory_files/hg38_minimalExclusionListOverlap_1Mbp_intervals_33pcOverlapLimited.FGCD.bed --reference-gc-content-distribution-table /opt/github/GCparagon/src/GCparagon/accessory_files/accessory_files/hg38_reference_GC_content_distribution.tsv cannot proceed - no genomic intervals BED file defined and default expected file not present under /opt/conda/envs/GCparagon/lib/python3.10/accessory_files/hg38_minimalExclusionListOverlap_1Mbp_intervals_33pcOverlapLimited.FGCD.bed. Terminating ..

BGSpiegl commented 1 week ago

I am sorry, I made yet another mistake (it is --intervals-bed, not --exclude-intervals). Please try these fixed parameters: --two-bit-reference-genome /opt/github/GCparagon/src/GCparagon/2bit_reference/hg38.analysisSet.2bit --intervals-bed /opt/github/GCparagon/src/GCparagon/accessory_files/hg38_minimalExclusionListOverlap_1Mbp_intervals_33pcOverlapLimited.FGCD.bed --reference-gc-content-distribution-table /opt/github/GCparagon/src/GCparagon/accessory_files/hg38_reference_GC_content_distribution.tsv [EDIT: yet another mistake - this time in the path]

mariaalexandrastanciu commented 1 week ago

I ran this: singularity run gcparagon-ubuntu-22_04-container_latest.sif -b /globalscratch/ulb/bctr/astanciu/healthy_sWGS/bams/Genome-IJB-HP-10-xx_S81.markDup.bam --two-bit-reference-genome /opt/github/GCparagon/src/GCparagon/2bit_reference/hg38.analysisSet.2bit --intervals-bed /opt/github/GCparagon/src/GCparagon/accessory_files/hg38_minimalExclusionListOverlap_1Mbp_intervals_33pcOverlapLimited.FGCD.bed --reference-gc-content-distribution-table /opt/github/GCparagon/src/GCparagon/accessory_files/hg38_reference_GC_content_distribution.tsv

Error:

Traceback (most recent call last): File "/opt/conda/envs/GCparagon/bin/gcparagon", line 8, in sys.exit(main()) File "/opt/conda/envs/GCparagon/lib/python3.10/site-packages/GCparagon/correct_GC_bias.py", line 3060, in main raise FileNotFoundError(f"the reference GC content distribution file could not be found/accessed!") FileNotFoundError: the reference GC content distribution file could not be found/accessed!

BGSpiegl commented 1 week ago

Made another mistake. This time in the assumed correct default path. Please try: --two-bit-reference-genome /opt/github/GCparagon/src/GCparagon/2bit_reference/hg38.analysisSet.2bit --intervals-bed /opt/github/GCparagon/accessory_files/hg38_minimalExclusionListOverlap_1Mbp_intervals_33pcOverlapLimited.FGCD.bed --reference-gc-content-distribution-table /opt/github/GCparagon/accessory_files/hg38_reference_GC_content_distribution.tsv I am curerntly also testing this locally. Sorry for the series of mistakes. I am planning to change the singularity.def file, create another image and make it available soon. [ Edit: the accessory_files dir is in the outer GCparagon directory; paths changed for: --intervals-bed and --reference-gc-content-distribution-table parameters ]

mariaalexandrastanciu commented 1 week ago

I have the same error:

singularity run gcparagon-ubuntu-22_04-container_latest.sif -b /globalscratch/ulb/bctr/astanciu/healthy_sWGS/bams/Genome-IJB-HP-10-xx_S81.markDup.bam --two-bit-reference-genome /opt/github/GCparagon/src/GCparagon/2bit_reference/hg38.analysisSet.2bit --intervals-bed /opt/github/GCparagon/src/GCparagon/accessory_files/hg38_minimalExclusionListOverlap_1Mbp_intervals_33pcOverlapLimited.FGCD.bed --reference-gc-content-distribution-table /opt/github/GCparagon/src/GCparagon/accessory_files/hg38_reference_GC_content_distribution.tsv

Traceback (most recent call last): File "/opt/conda/envs/GCparagon/bin/gcparagon", line 8, in sys.exit(main()) File "/opt/conda/envs/GCparagon/lib/python3.10/site-packages/GCparagon/correct_GC_bias.py", line 3060, in main raise FileNotFoundError(f"the reference GC content distribution file could not be found/accessed!") FileNotFoundError: the reference GC content distribution file could not be found/accessed!

BGSpiegl commented 1 week ago

This is very odd. I am trying to create a new image now from the current code (tested successfully in script form). I am sorry for this persisting problem. An alternative would be to create the conda environment from the YAML file and run GCparagon simply as a python3 script. This, of course, can also fail if the conda environment can't be created for som dubious reason.

I will get back to you when I know more.

mariaalexandrastanciu commented 1 week ago

I would have tried with conda, but I am working on a cluster and I don't have this option unfortunately.
Thank you for your help!

BGSpiegl commented 1 week ago

Hi Maria.

Thank you once again for bringing up this issue and helping me making GCparagon a more convenient tool. I think commit 93aa29c solved your issue. Please also have a look at release v0.6.8 and the updated README.md. The new singularity_definition_file/build_and_test_singularity_image.sh shows an example for running GCparagon from the container and binding/mounting a directory that is inaccessible to any singularity container by default (the -B parameter).

Please get the new container: singularity pull library://bgspiegl/gcparagon/gcparagon_0.6.8-ubuntu-22_04-container:latest

Does the --reference-genome-build hg38 parameter work now for you with this new container?

The problem of the default file paths breaking persists though for local installations of GCparagon (running pip install .). I will fix this in another release.

BR Benjamin

mariaalexandrastanciu commented 1 week ago

Hi,

I have downloaded the image you suggested and I ran it with the -B parameter:

singularity run -B /globalscratch/ulb/bctr/astanciu/ gcparagon_0.6.8-ubuntu-22_04-container_latest.sif --bam /globalscratch/ulb/bctr/astanciu/healthy_sWGS/bams/Genome-IJB-HP-10-xx_S81.markDup.bam

And it worked.

Thank you.

Best regards, Alexandra