bcgsc / straglr

Tandem repeat expansion detection or genotyping from long-read alignments
Other
50 stars 9 forks source link

An issue about reproducibility. #16

Closed fjmuzengyiheng closed 8 months ago

fjmuzengyiheng commented 1 year ago

**Hi, @readmanchiu

I am a new fan of the first-tier tool Straglr. I am trying to use this tool to genotype the genome-wide STR/VNTR for my ONT data. I followed your guidance in the "Example application: genome-wide genotyping" part. However, I found an issue about reproducibility.

Here is my steps:**

(1) I download UCSC Simple Repeats track as an tsv file; (2) I deleted the prefix "chr" to be compatible with my genome reference; (3) I also deleted the alternative contigs and kept those STR/VNTR loci in the 22+XY chromosome; (4) I converted downloaded table into bed format using:

grep -v '^#' simple_repeats.downloaded.tsv | awk -vOFS='\t' 'length($17)>1 {print $2,$3,$4,$17}' > simple_repeats.bed

(5) I split whole-genome bed file into batches using:

split -l 10000 -d -a 4 --additional-suffix=.bed simple_repeats.bed simple_repeats_nochr

Then, I run the Genotype mode (locus given):

work_dir=$(pwd) map_ref=/genetics/home/zengyiheng/project/zyh-pipeline/reference/grch38/GRCh38_alignment.fa straglr_sif=/genetics/home/zengyiheng/project/zyh-pipeline/straglr/straglr_blastn_trf.sif str_bed2=/genetics/home/zengyiheng/project/zyh-pipeline/straglr/str_db_ucsc_hg38/nochr/simple_repeats_nochr_0000.bed mkdir ${work_dir}/str mkdir ${work_dir}/tmp

cd ${work_dir}/str singularity exec ${straglr_sif} python /usr/local/bin/straglr.py \ ${work_dir}/bam/19620.hg38.bam \ ${map_ref} \ straglr_scan_19620 \ --loci ${str_bed2} \ --min_str_len 3 \ --max_str_len 100 \ --min_ins_size 100 \ --genotype_in_size \ --min_support 2 \ --max_num_clusters 2 \ --nprocs ${cpu_thread} \ --tmpdir ${work_dir}/tmp

=============

The [simple_repeats_nochr_0000.bed] contains 10000 loci; I run the above command four times respectively setting --nprocs ${cpu_thread} to be 2 / 4 / 48 / 48, and found the results are not the same.

[the --nprocs i used] [the number of loci genotyped successfully recorded in the straglr_scan_19620.bed file ] --nprocs 2 [9719 loci] --nprocs 4 [9808 loci] --nprocs 48 (1st run) [9878 loci] --nprocs 48 (2nd run) [9880 loci]

The output of four runs is uploaded here. ( https://airportal.cn/965275/FmgSnxkGpH )(password: straglr)

It really confused me that the numbers of output loci were not 10000 and different from different runs. Could you please be so kind to help me with this issue? Thank you so much. : )

readmanchiu commented 1 year ago

Hi @fjmuzengyiheng, thanks for reporting this issue. Could you trying the latest code in the repository? I have tried it on my own data using different number of processes, the number of events remain the same. The resulting genotype may differ slightly as Sklearn's GMM clustering might give (mostly slightly) different results occasionally. Also note that you don't have to run with --min_str_len, --max_str_len, and --min_ins_size when you run in genotyping mode i.e. when --loci was provided, those parameters are only applicable when running in genome_scan mode.

fjmuzengyiheng commented 1 year ago

Thank you so much for your reply. It really helped. And I also deleted the --min_str_len, --max_str_len, and --min_ins_size when I run in genotyping mode.

But, another issue bothered me as well. I have about 900,000 loci from the simple_repeats_nochr_22XY.bed (hg38) I split the simple_repeats_nochr_22XY.bed (hg38) into 9 files (100,000 loci per file):

simple_repeats_nochr_22XY_0000.bed simple_repeats_nochr_22XY_0001.bed .... simple_repeats_nochr_22XY_0008.bed

!/bin/bash

SBATCH -J zyh-mono0000

SBATCH -p normal

SBATCH -N 1

SBATCH --output=log.%j.out

SBATCH --error=log.%j.err

#SBATCH --cpus-per-task=48

SBATCH --nodelist=node2

ulimit -HSn 1002400 ulimit -s unlimited ulimit -l unlimited

singularity exec ${straglr_sif} python /usr/local/bin/straglr.py \ ${work_dir}/bam/19620.hg38.bam \ ${map_ref} \ straglr_scan_19620 \ --loci simple_repeats_nochr_22XY_0000.bed \ --genotype_in_size \ --min_support 2 \ --max_num_clusters 2 \ --nprocs 48 \ --tmpdir ${work_dir}/tmp

call TR_start: 2022-08-07 16:27:29 call TR_end: 2022-08-08 17:53:19

It took about 25h to genotype 100,000 loci. If I want to genotype all the genome-wide STR/VNTR loci, it would take about 9 days to genotype one sample using 48 CPU.

Did I set the parameter wrong? Can I make it faster?

readmanchiu commented 1 year ago

How deep is your sample sequenced? I suspect it's pretty deep? For a 20-30x sample for 100,000 STRs, it took 1.5h using 32 processors. If it's really deep, like 80-90x, it's gonna take longer as every read at the locus needs to be looked at.

fjmuzengyiheng commented 1 year ago

Hi, my ONT data is a 20-30x sample. If I selected loci with small size (eg. 3bp to 6bp), it runs faster within 2.5h using --nprocs 70. That shall work. Thank you so much for being patient to my issues.

readmanchiu commented 1 year ago

@fjmuzengyiheng, I believe you are using Straglr v1.2. If you want to give the current code in the repository a try, it should be much faster. I am going to make this a public release hopefully later this week, it will be v1.3. Thanks very much for trying Straglr.

fjmuzengyiheng commented 1 year ago

Thank you so much. Looking forward to the new public release v1.3. : - )

fjmuzengyiheng commented 1 year ago

Hi, Prof.Readman Chiu.

Sorry for bother again. I am using this first-tier tool for STR counting for my neurogenetic patients.

Here is one issue I want to report:

There is one disease named "CANVAS (Cerebellar ataxia, neuropathy, and vestibular areflexia syndrome)", which is caused by an expansion of (AAGGG)n repeat in RFC1 gene. (https://omim.org/entry/102579?search=RFC1&highlight=rfc1)

The sticky situation lies in:

The reference sequence is (AAAAG)n for this loci (hg38, 4:39,348,424-39,348,483). There is an (AAGGG)n expansion in my data (ONT), which is confirmed by visualizing manually by IGV. (attached below) When I use the bed file below, Straglr outputs nothing.

【bed file】 4 39348424 39348483 AAAAG"

I am not willing to give up this tool for its outstanding performance. Can Straglr deal with this "complex STRs (changed motif situation)"? I will be pleased if Straglr can deal with this situation, which will make it SUPER perfect. Thank you again. This issue is also raised in Github (https://github.com/bcgsc/straglr/issues/19).

[attached file1: IGV visualization for my patient's ONT data]

[attached file2: 3510bp insertion in the first line]

AGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGCGGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGCAATACAGAAGAAGAAGTAATACAGAAGGAAGGAAGGAAGGGAAGGGAAGGAAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGAAGGGAAGGGAAGGCAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAAGGAAGGAAGGGAAGGGAAGGGAAGGGAGGAAGGGAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGCGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGGAAGGAAGGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAGGGAAGGAAGGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGAAGGAAGGAAGGAAGGAAGGGAAGGGAAGGAAGGGAAGGAAGGAAGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGGAAGGG

-----原始邮件----- 发件人:readmanchiu @.> 发送时间:2022-08-10 03:27:50 (星期三) 收件人: bcgsc/straglr @.> 抄送: fjmuzengyiheng @.>, Mention @.> 主题: Re: [bcgsc/straglr] An issue about reproducibility. (Issue #16)

@fjmuzengyiheng, I believe you are using Straglr v1.2. If you want to give the current code in the repository a try, it should be much faster. I am going to make this a public release hopefully later this week, it will be v1.3. Thanks very much for trying Straglr.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

readmanchiu commented 8 months ago

Issue followed up through private communication