KCCG / ClinSV

Robust detection of clinically relevant structural and copy number variation from whole genome sequencing data
Other
61 stars 8 forks source link

IGV hg38 fix #63

Open akiss-me opened 3 months ago

akiss-me commented 3 months ago

Dear all,

I fixed IGV hg38 issue, the solution is not very clean, since I did not rewrite the code, but simply replaced b37 with hg38 and the corresponding tracks. I didn’t make changes to the working image or fork it, so for now I suggest just mounting the corrected version of the main script (please unzip first).

this is how to run:

docker run --rm \
--mount type=bind,source=/**your_path**/clinsv_hg38,target=/app/clinsv/bin/clinsv \
--name clinsv \
-v /your_path/ \
--entrypoint "perl" mrbradley2/clinsv:v1.1-dev /app/clinsv/bin/clinsv \
-r all \
-p "/**your_path**/clinsv_results/" \
-i "/**your_path**/*.bam" \
-ref "/**your_path**/refdata-b38/refdata-b38"

enjoy!

clinsv_hg38.zip

thedam commented 2 months ago

seems like almost all paths in .xml are generated wrongly (that includes the change in the structure of the folders).

Is the ClinSV project abandoned? The biggest pain for me is that needs "chr1, chr2..." naming convention insead of "1, 2, 3..."

akiss-me commented 2 months ago

seems like almost all paths in .xml are generated wrongly (that includes the change in the structure of the folders).

Is the ClinSV project abandoned? The biggest pain for me is that needs "chr1, chr2..." naming convention insead of "1, 2, 3..."

Folder structures have been made for a kind of IGV server that should be run from docker ClinSV (but was cropped from v1.1-dev). I used to solve this problem by running lscr.io/linuxserver/webtop:ubuntu-kde with installed desktop IGV and making a mount point and paths according to paths from ClinSV's XML, and it's working.

J-Bradlee commented 2 months ago

First of, thank you @akiss-me for sharing your fix. I've been working on merging some changes that should hopefully cleanly address the IGV issue. As @akiss-me mentioned, using the v1.1-dev version can handle those naming conventions for you. As for the XML files, if you are using the docker images all the non-resource paths (such as the ref genome), are file paths used in the docker container.

Therefore, if you wanted to use the XML outside of the container in an IGV session you are going to have the change the paths in the XML file to the respective paths as is on your machine. Which are the same paths you mounted onto the docker container with the "-v" commands.

In future I hope to implement this as a script that you can run outside the container.

thedam commented 2 months ago

Hi @J-Bradlee, I'm happy that the project is sitll alive! I have some concerns; looking only at the data in the .xml file generated for hg38, we can see:

<Session genome="b37" hasGeneTrack="true" hasSequenceTrack="true" locus="1:197493130-197514036" version="8"> here should be genome="hg38" as @akiss-me noted.

Then we have for example:

<Resource path="/app/ref-data/refdata-b38/tracks/GRCh37GenomicSuperDup.bed.gz"/> I believe it should be /refdata-b38/tracks/GRCh38GenomicSuperDup.bed.gz

<Resource path="/app/ref-data/refdata-b38/tracks/DGV_1kG_2015-07-18.igv.bed.gz"/> Should it be this file: /refdata-b38/tracks/DGV_GRCh38_hg38_variants_2020-02-25.igv.bed.gz?

<Resource path="/app/ref-data/refdata-b38/tracks/GRCh37_hg19_variants_2015-07-23.igv.bed.gz"/> no such file for hg38?

<Resource path="/app/ref-data/refdata-b38/annotation/Homo_sapiens.GRCh37.75.gff.gz"/> this: /refdata-b38/annotation/Homo_sapiens.GRCh38.99.gff.gz

Are we sure these files are also not misslabelled:

<Resource path="/app/ref-data/refdata-b38/tracks/MGRB-SV.igv.bed.gz"/>
<Resource path="/app/ref-data/refdata-b38/tracks/popCovStdev.bw"/>

.xml is easy to fix anyway. However can we be sure, that in the code and during the data processing, the right paths/files are used? Are the all CNVs calculated on the right coordinates system? At the end, it is the most important...

Anyway, you do a great job, I'm impressed how complex the application is.

ps. do you plan the handle chromosomes "1, 2, 3, 4, MT, X, Y," in hg38 (in addtion to chr1, chr2, chrM...)? I would be super great!

J-Bradlee commented 2 months ago

Hi @thedam,

Those resource paths are like that because you have run clinSV in the docker container. In the container those resource files are at those locations. However, on your own machine they will be whatever you set these environment variables to: refdata_path=$PWD/clinsv/refdata-b38 input_path=$PWD project_folder=$PWD/test_run

Recall, you attached the volumes to the docker container with the "-v" command: docker run -v $refdata_path:/app/ref-data \ -v $project_folder:/app/project_folder \ -v $input_path:/app/input \ --entrypoint "perl" mrbradley2/clinsv:v1.0 /app/clinsv/bin/clinsv \ -r all \ -p /app/project_folder/ \ -i "/app/input/*.bam" \ -ref /app/ref-data/refdata-b38 \

A quick fix would be to run a sed command to on the result XML to remove the "app/ref-data" .

These resource files are used throughout ClinSV accurately .

I have made a feature to handle the hg38 / hg19 chormosomes and is already dockerized for you to use. Albeit, I am currently trying to fix a bug which has broken "coverage by chromosome" part of the output report. See here and here

thedam commented 2 months ago

Yes, I've run the job with the exact Docker command. What I mean, is that in the b38 result .xml file the paths for b37 apperas.

Also the the pipeline doesn't work with GRCh38 chromosome naming convention (1, 2, 3, 4 instead of chr1, chr2 chr3...)

When the input is .bam file with chromosmes 1, 2, 3... the error appers:

### executing: sh /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.sh &> /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e  ...

 ### finished after (hh:mm:ss): 00:00:01
 ### exist status: 256

 ***** error exist status != 0 (256), please check /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e for more information

The error appears because somewhere in ClinSV code, there is samtools command with chr: samtools view /app/project_folder/alignments/273858DL/273858DL.bamchr1:1000000-1100000 [main_samview] region "chr1:1000000-1100000" specifies an invalid region or unknown reference. Continue anyway

I quess it could be fixed by providing the reference, that .bam is aligned to and it should be named as Homo_sapiens_assembly38.fasta (as it is hardcoded in clinsv.py, so in the code in the Docker).

All files in refdata-b38 have also naming convention with 'chr1, chr2, chr3' so they should be also changed?

akiss-me commented 2 months ago

Yes, I've run the job with the exact Docker command. What I mean, is that in the b38 result .xml file the paths for b37 apperas.

Also the the pipeline doesn't work with GRCh38 chromosome naming convention (1, 2, 3, 4 instead of chr1, chr2 chr3...)

When the input is .bam file with chromosmes 1, 2, 3... the error appers:

### executing: sh /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.sh &> /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e  ...

 ### finished after (hh:mm:ss): 00:00:01
 ### exist status: 256

 ***** error exist status != 0 (256), please check /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e for more information

The error appears because somewhere in ClinSV code, there is samtools command with chr: samtools view /app/project_folder/alignments/273858DL/273858DL.bamchr1:1000000-1100000 [main_samview] region "chr1:1000000-1100000" specifies an invalid region or unknown reference. Continue anyway

I quess it could be fixed by providing the reference, that .bam is aligned to and it should be named as Homo_sapiens_assembly38.fasta (as it is hardcoded in clinsv.py, so in the code in the Docker).

All files in refdata-b38 have also naming convention with 'chr1, chr2, chr3' so they should be also changed?

This is probably because you are using DIY genome reference. It's okay, but for cross compatibility, researchers and medgenetics usually use GATK genome bundle, which comes from the reference with "chr" at the beginning and is kind of standard. You can add chr to the chromosome names in the bam file (https://www.biostars.org/p/13462/), but it could lead to other problems caused by unassembled contigs naming.

thedam commented 2 months ago

Both conventions are used (unfortunately). The main service that doesn't use prefixes is VEP:(

drmjc commented 2 months ago

Hi, I thought v1.1 supported bam files aligned to hg19 and b38 and allowed the contigs to be named 1, 2 3 or chr1, chr2, chr3 etc; James can you investigate that? Most ClinSV ref files were recreated for hg38, at least the large ones that were under our control, like the four different noise files.

v1.0.1 included a fix for the b38 reference files in the igv xml file, which is not in v1.1.

Altering the remaining file paths that otherwise point to locations within a docker container needs another change that I know James is working on now.

So, look forward to v1.1.1 which will bring in all those changes.

Cheers

On Fri, 26 Apr 2024, 1:50 am akiss-me, @.***> wrote:

Yes, I've run the job with the exact Docker command. What I mean, is that in the b38 result .xml file the paths for b37 apperas.

Also the the pipeline doesn't work with GRCh38 chromosome naming convention (1, 2, 3, 4 instead of chr1, chr2 chr3...)

When the input is .bam file with chromosmes 1, 2, 3... the error appers:

executing: sh /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.sh &> /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e ...

finished after (hh:mm:ss): 00:00:01

exist status: 256

***** error exist status != 0 (256), please check /app/project_folder/SVs/joined/lumpy/sh/lumpy.caller.joined.e for more information

The error appears because somewhere in ClinSV code, there is samtools command with chr: samtools view /app/project_folder/alignments/273858DL/273858DL.bam chr1:1000000-1100000 [main_samview] region "chr1:1000000-1100000" specifies an invalid region or unknown reference. Continue anyway

I quess it could be fixed by providing the reference, that .bam is aligned to and it should be named as Homo_sapiens_assembly38.fasta (as it is hardcoded in clinsv.py, so in the code in the Docker).

All files in refdata-b38 have also naming convention with 'chr1, chr2, chr3' so they should be also changed?

This is probably because you are using DIY genome reference. It's okay, but for cross compatibility, researchers and medgenetics usually use GATK genome bundle, which comes from the reference with "chr" at the beginning and is kind of standard. You can add chr to the chromosome names in the bam file (https://www.biostars.org/p/13462/), but it could lead to other problems caused by unassembled contigs naming.

— Reply to this email directly, view it on GitHub https://github.com/KCCG/ClinSV/issues/63#issuecomment-2077619775, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEQQM7LCDX2DU4AIPYM72TY7EQ35AVCNFSM6AAAAABFJRELDSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXGYYTSNZXGU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

J-Bradlee commented 2 months ago

Hi @thedam,

Can I confirm what docker image you are using/what version of ClinSV you are running?

thedam commented 2 months ago

Hi,

I use: mrbradley2/clinsv v1.1-dev **ad470d0fa5d2** 19 months ago 7.41GB

I have it form Docker Hub, docker pull mrbradley2/clinsv:v1.1-dev however I see it is the same image id as here: docker pull containerregistrypubliccb.azurecr.io/clinsv:v1.1-dev

containerregistrypubliccb.azurecr.io/clinsv       v1.1-dev   ad470d0fa5d2   19 months ago   7.41GB
mrbradley2/clinsv                                 v1.1-dev   ad470d0fa5d2   19 months ago   7.41GB