drneavin / Demultiplexing_Doublet_Detecting_Docs

MIT License
14 stars 1 forks source link

Replace sample IDs to header of cluster_genotypes.vcf for Assign_Indiv_by_Geno.R ? #36

Closed MrLocuace closed 5 months ago

MrLocuace commented 9 months ago

Hello @drneavin, I am having a problem running Assign_Indiv_by_Geno.R:

apptainer exec --bind /project/PI/USERS/me/czi/data/scrna/ Demuxafy.sif Assign_Indiv_by_Geno.R -r /project/PI/USERS/me/czi/data/scrna/souporcell/batch1.MAF0.05.vcf.gz -c /project/PI/USERS/me/czi/data/scrna/souporcell/LB-CC-20s-02-LPS-GEX-b1-FC02_souporcell/cluster_genotypes.vcf -o /project/PI/USERS/me/czi/data/scrna/souporcell/LB-CC-20s-02-LPS-GEX-b1-FC02_souporcell

Scanning file to determine attributes. File attributes: meta lines: 33 header_line: 34 variant count: 3804937 column count: 35 Meta line 33 read in. All meta lines processed. gt matrix initialized. Character matrix gt created. Character matrix gt rows: 3804937 Character matrix gt cols: 35 skip: 0 nrows: 3804937 row_num: 0 Processed variant: 3804937 All variants processed Scanning file to determine attributes. File attributes: meta lines: 40 header_line: 41 variant count: 116190 column count: 35 Meta line 40 read in. All meta lines processed. gt matrix initialized. Character matrix gt created. Character matrix gt rows: 116190 Character matrix gt cols: 35 skip: 0 nrows: 116190 row_num: 0 Processed variant: 116190 All variants processed Found GT genotype format in cluster vcf. Will use that metric for cluster correlation. Detected / separator for GT genotype format in cluster vcf Found GT genotype format in reference vcf. Will use that metric for cluster correlation. Detected / separator for GT genotype format in reference vcf Found REF and ALT in both cluster and reference genotype vcfs. Will use chromosome, position, REF and ALT to match SNPs. Joining, by = "ID" Joining, by = "ID" Joining, by = "ID" [1] "AYM-4-071" [1] "0" [1] "1" [1] "2" [1] "3" [1] "4" [1] "5" [1] "6" [1] "7" [1] "8" [1] "9" [1] "10" [1] "11" [1] "12" [1] "13" [1] "14" [1] "15" [1] "16" [1] "17" [1] "18" [1] "19" [1] "20" [1] "21" [1] "22" Error in cor(as.numeric(pull(ref_df, col)), as.numeric(pull(clust_df, : no complete element pairs Calls: pearson_correlation -> cor Execution halted

There should be 26 clusters/individuals/samples. Apparently the program is not matching by sample IDs between reference genotypes and cluster genotypes, since the clusters in cluster_genotypes.vcf are called like:

[1] "0" [1] "1" [1] "2" [1] "3" [1] "4" [1] "5" [1] "6" [1] "7" [1] "8" [1] "9" [1] "10" [1] "11" [1] "12" [1] "13" [1] "14" [1] "15" [1] "16" [1] "17" [1] "18" [1] "19" [1] "20" [1] "21" [1] "22"

What should I do to fix this issue?. Do I need to "manually" replace in the header of cluster_genotypes.vcf file, the "1 2 3 4 5 ..." IDs by the sample IDs from the reference genotype file?. If so, what order of sample IDs do I need to provide?. The same order as in the reference genotype file?

Thank you

drneavin commented 9 months ago

Hi @MrLocuace , this typically means that there aren't any SNPs that are overlapping between the two vcf files. By default, it checks for SNP location and ref and alt alleles to generate unique SNP IDs for each SNP. From the message you posted, it appears that you have location and ref and alt alleles in both files so it's making a SNP ID using all of those parameters. My guess is that the chr encoding for the chromosmes doesn't match between the two files (chr1 in one file and 1 in the other) but it could be a mismatch for the ref and alt alleles as well as something else causing the mismatch. Would you mind sending through a head of each file that includes some of the SNPs?

MrLocuace commented 9 months ago

cluster_genotypes.vcf

fileformat=VCFv4.3

fileDate=05122018_15h52m43s

source=IGSRpipeline

reference=ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

INFO=

FILTER=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AYM-4-071 CHI-0-060-2 MAP-9-019 QUE-4-038 PER025 PER116 AYM-4-078 CHI-0-059 MAP-9-020 QUE-4-042 PER031 PER110 AYM-4-073 CHI-0-058 MAP-9-017 QUE-4-047 PER030 AYM-4-077 CHI-0-061 MAP-9-016 PER112 PER115 CHI-0-063 PER092 PER088 PER106

reference_genotypes.vcf

fileformat=VCFv4.2

FILTER=

fileDate=20231206

source=PLINKv1.90

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

contig=

INFO=

FORMAT=

INFO=

INFO=

bcftools_viewVersion=1.15.1-21-gf2694e5+htslib-1.15.1-40-g226c1a8

bcftools_viewCommand=view --types snps -q 0.05:minor -m2 -M2 -Oz -o peru139.chile284.MAF0.05.vcf.gz peru139.chile284.vcf; Date=Wed Dec 6 14:01:01 2023

bcftools_viewCommand=view -s AYM-4-071,CHI-0-060-2,MAP-9-019,QUE-4-038,PER025,PER116,AYM-4-078,CHI-0-059,MAP-9-020,QUE-4-042,PER031,PER110,AYM-4-073,CHI-0-058,MAP-9-017,QUE-4-047,PER030,AYM-4-077,CHI-0-061,MAP-9-016,PER112,PER115,CHI-0-063,PER092,PER088,PER106 -Oz -o /project/lbarreiro/USERS/lucas/czi/data/scrna/souporcell/batch1.MAF0.05.vcf.gz peru139.chile284.MAF0.05.vcf.gz; Date=Sat Jan 20 11:49:06 2024

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AYM-4-071 CHI-0-060-2 MAP-9-019 QUE-4-038 PER025 PER116 AYM-4-078 CHI-0-059 MAP-9-020 QUE-4-042 PER031 PER110 AYM-4-073 CHI-0-058 MAP-9-017 QUE-4-047 PER030 AYM-4-077 CHI-0-061 MAP-9-016 PER112 PER115 CHI-0-063 PER092 PER088 PER106

I get this error:

ESC[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.ESC[0m ESC[0;1;31mFailed to connect to bus: Host is downESC[0m

caught segfault address 0x30, cause 'memory not mapped'

Traceback: 1: .read_body_gz(file, stats = stats, nrows = nrows, skip = skip, cols = cols, convertNA = as.numeric(convertNA), verbose = as.numeric(verbose)) 2: read.vcfR(args$cluster_vcf) An irrecoverable exception occurred. R is aborting now ... /var/spool/slurm/d/job14100407/slurm_script: line 26: 3339672 Segmentation fault (core dumped) apptainer exec --bind $DIR:/tmp Demuxafy.sif Assign_Indiv_by_Geno.R -r /tmp/souporcell/batch1.MAF0.05.vcf.gz -c /tmp/souporcell/${SAMPLE}_souporcell/cluster_genotypes.vcf -o /tmp/souporcell/${SAMPLE}_souporcell

drneavin commented 9 months ago

Thanks @MrLocuace , the header of each file looks fine. But segfault indicates an error with memory. Have you increased the amount of memory you're using for this job?

MrLocuace commented 9 months ago

@drneavin yes, I increased it up to 140 GB (double than before).

I overlifted the reference genotypes to the same genome build of the cluster genotypes and the script now produces ref_clust_pearson_correlations.tsv, an empty ref_clust_pearson_correlation.png, but no Genotype_ID_key.txt.

Suddently the process stops with this:

ESC[0;1;31mSystem has not been booted with systemd as init system (PID 1). Can't operate.ESC[0m ESC[0;1;31mFailed to connect to bus: Host is downESC[0m Found GT genotype format in cluster vcf. Will use that metric for cluster correlation. Detected / separator for GT genotype format in cluster vcf Found GT genotype format in reference vcf. Will use that metric for cluster correlation. Detected / separator for GT genotype format in reference vcf Found REF and ALT in both cluster and reference genotype vcfs. Will use chromosome, position, REF and ALT to match SNPs. Joining, by = "ID" Joining, by = "ID" Joining, by = "ID" There were 26 warnings (use warnings() to see them) Error in hclust(get_dist(submat, distance), method = method) : NA/NaN/Inf in foreign function call (arg 10) Calls: print ... make_row_cluster -> .local -> make_cluster -> hclust Execution halted

drneavin commented 9 months ago

This is typically happens when there are no snps in common between the two files. Could you send through the first few snps in each file so I can make sure the format looks ok? And how many snps are in each vcf file?

drneavin commented 5 months ago

Hi @MrLocuace , closing this due to inactivity but please reopen if it is either not resolved or with the solution you found to get this to work so other users might be able to reference it in case they run up against the same problem