Illumina / canvas

Canvas - Copy number variant (CNV) calling from DNA sequencing data
Other
121 stars 20 forks source link

FlagUniqueKmers estimated run time for a GRCh38 - 3.1G Genome file #84

Open BenoitFiset opened 6 years ago

BenoitFiset commented 6 years ago

Hi,

Setting up files for a trial Tumor-normal-enrichment Canvas run and have it seems I have to run FlagUniqueKmers. FlagUniqueKmers doesn't seem to be multicore...

After an hour I only had this completed.... still lots of Chromosomes left....

5/9/2018 3:42:16 PM Start
Load FASTA file at Homo_sapiens.GRCh38.dna.primary_assembly.fa, write kmer-flagged output to Homo_GRCh38_kmer.fa
>>>1 1 0 dict 0 incomplete 0
>>>1 1 1000000 dict 542569 incomplete 0
>>>1 1 2000000 dict 1456703 incomplete 0
>>>1 1 3000000 dict 2334153 incomplete 0
>>>1 1 4000000 dict 3311993 incomplete 0
>>>1 1 5000000 dict 4292821 incomplete 0
>>>1 1 6000000 dict 5273936 incomplete 0
>>>1 1 7000000 dict 6234028 incomplete 0
>>>1 1 8000000 dict 7201370 incomplete 0

What should I expect as a run time for a complete 22 CHR + X + Y and MT run for a GRCh38 file

Thanks,

B.

eroller commented 6 years ago

We haven't benchmarked this in years since GRCh38 has been out for a while. We run it on a high mem windows machine and still takes about a day if I can remember. Time could be different on Linux as well. Is there a reason you cannot just download the prebuild kmer fasta?

BenoitFiset commented 6 years ago

Well... You might make it simple depending on your answer ....

Your kmer.fa is built with the nomenclature of ">ch1", "chr2",">ChrX" .... and all my files are set with ">1", ">2"," >X" .... Your kmer.fa has others like ">chrEBV" which I don't have.

I'm guessing that Chromosome nomenclature has to match in all the files...

This is the reason I wanted to use FlagUniqueKmers... your kmer.fa is the only file with the different chromosome nomenclature. Even the files from ftp://ftp.ncbi.nih.gov/snp/organisms/ (the other ticket from me) has the ">1", ">2", ">X" nomenclature.

Thanks.

eroller commented 6 years ago

Sigh, I thought NCBI and UCSC had gotten their act together regarding chromosome naming convention at least for GRCh38. I guess not. I think UCSC naming is still the preferred convention though. As a workaround you could rename chromosomes in our kmer.fa and genome.fa files or realign using our genome.fa and updated chromosome names in this file:

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/common_all_20180418.vcf.gz

or just use the older dbsnp vcf we provide.

I'll leave this ticket open in case we ever revist FlagUniqueKmers to speed it up.

BenoitFiset commented 6 years ago

So you confirm the all files have to match in nomenclature.

The file:

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh38p7/VCF/common_all_20180418.vcf.gz

has the ">1", ">2", ">X" nomenclature. So this leaves me only your version of kmer.fa different from all my files.

So I guess it's grep, cut, awk and perl -pe time on the kmer.fa file.

In the same time, I'll run FlagUniqueKmers on a linux server with 32 GB of ram and 24 cores... with a wall time of 48 hrs.... to see if it's able to finish up....

Thought, can I cut up my genome file in parts, run FlagUniqueKmers on these parts and reassemble them after.... a kind of "Manual Parallel" version of FlagUniqueKmers ?

Thanks

eroller commented 6 years ago

yes, the chromosome names have to match. I think it is OK if kmer fasta has extra contigs not present in your bam/vcf.

splitting by chromosome won't work well because we are checking for kmers that are unique across the entire genome, not just unique within a chromosome.

BenoitFiset commented 6 years ago

Cool Thanks.

I'll do a 48h run of FlagUniqueKmers ...I hope this is enough walltime. At same time will grep cut, awk and perl -pe the kmer.fa file for my coding pleasure ....

BenoitFiset commented 6 years ago

Hi Eric,

After 48 hrs .... still in Chromosome 1.... at that pace, it will take months !!!

Any tips to speed things up with FlagUniqueKmers ?

5/10/2018 6:11:00 PM Start
Load FASTA file at /ltmp/bfiset/Homo_sapiens.GRCh38.dna.primary_assembly.fa, write kmer-flagged output to /mnt/parallel_scratch_mp2_wipe_on_december_2018/pfiset1/bfiset/Ensembl91/Canvas-Data/Homo_GRCh38_kmer.fa
>>>1 1 0 dict 0 incomplete 0
>>>1 1 1000000 dict 542569 incomplete 0
>>>1 1 2000000 dict 1456703 incomplete 0
>>>1 1 3000000 dict 2334153 incomplete 0
>>>1 1 4000000 dict 3311993 incomplete 0
>>>1 1 5000000 dict 4292821 incomplete 0
>>>1 1 6000000 dict 5273936 incomplete 0
>>>1 1 7000000 dict 6234028 incomplete 0
>>>1 1 8000000 dict 7201370 incomplete 0
>>>1 1 9000000 dict 8148998 incomplete 0
>>>1 1 10000000 dict 9080870 incomplete 0
>>>1 1 11000000 dict 10020072 incomplete 0
>>>1 1 12000000 dict 10958849 incomplete 0
>>>1 1 13000000 dict 11844752 incomplete 0
>>>1 1 14000000 dict 12550030 incomplete 0
>>>1 1 15000000 dict 13513307 incomplete 0
>>>1 1 16000000 dict 14445521 incomplete 0
>>>1 1 17000000 dict 15249629 incomplete 0
>>>1 1 18000000 dict 16202542 incomplete 0
>>>1 1 19000000 dict 17178360 incomplete 0
>>>1 1 20000000 dict 18125537 incomplete 0
>>>1 1 21000000 dict 19076381 incomplete 0
>>>1 1 22000000 dict 19987364 incomplete 0
>>>1 1 23000000 dict 20938453 incomplete 0
>>>1 1 24000000 dict 21851643 incomplete 0
>>>1 1 25000000 dict 22788631 incomplete 0
>>>1 1 26000000 dict 23687764 incomplete 0
>>>1 1 27000000 dict 24591375 incomplete 0
>>>1 1 28000000 dict 25497267 incomplete 0
>>>1 1 29000000 dict 26379358 incomplete 0
>>>1 1 30000000 dict 27333209 incomplete 0
>>>1 1 31000000 dict 28289666 incomplete 0
>>>1 1 32000000 dict 29213350 incomplete 0
>>>1 1 33000000 dict 30106343 incomplete 0
>>>1 1 34000000 dict 31063198 incomplete 0
>>>1 1 35000000 dict 32020753 incomplete 0
>>>1 1 36000000 dict 32935040 incomplete 0
>>>1 1 37000000 dict 33874783 incomplete 0
>>>1 1 38000000 dict 34810117 incomplete 0
>>>1 1 39000000 dict 35756970 incomplete 0
>>>1 1 40000000 dict 36659258 incomplete 0
>>>1 1 41000000 dict 37577969 incomplete 0
>>>1 1 42000000 dict 38538116 incomplete 0
>>>1 1 43000000 dict 39478332 incomplete 0
----------------------------------------
Begin PBS Epilogue Sat May 12 18:10:48 EDT 2018 1526163048

Thanks

eroller commented 6 years ago

This might help: https://github.com/Illumina/canvas/issues/48, but otherwise I don't think there will be a quick fix for the slow FlagUniqueKmers. I suspect the runtime on a high RAM windows machine could be substantially different.

Replacing the contig names in the kmer.fa seems like the best workaround.

BenoitFiset commented 6 years ago

Wow.... How much ram would you suggest when you say high RAM windows machine ?

Would 32GB of ram be my bottle neck ? If I go 256 or 512 GB of ram, would this help ?

Is this export COMPlus_gcAllowVeryLargeObjects=1 default in version Canvas-1.35.1.1316+master_x64 that I'm using ?

Thanks.

eroller commented 6 years ago

Sorry, I can't say for sure how much memory it uses. If you are caching to disk then there is not enough RAM.

If you are running the FlaqUniqueKmers script then yes that setting is set:

FlagUniqueKmers_DIR="$( dirname "$( readlink -f "${BASH_SOURCE[0]}" )" )"
export PATH=/illumina/sync/software/unofficial/Isas/packages/dotnet-1.1.2:$PATH
export COMPlus_gcAllowVeryLargeObjects=1
exec dotnet ${FlagUniqueKmers_DIR}/FlagUniqueKmers.dll "$@"
BenoitFiset commented 6 years ago

I was running the FlagUniqueKmers.dll with Mono 5.10 on Centos can't say if caching to disk as I don't have access to the node when I submit a job.

Is there an equivalent of FlaqUniqueKmers script for Mono ? Or I only need to : export COMPlus_gcAllowVeryLargeObjects=1 and run FlagUniqueKmers.dll ?

export COMPlus_gcAllowVeryLargeObjects=1
mono ../Canvas/Canvas-1.35.1.1316+master_x64/Tools/FlagUniqueKmers/FlagUniqueKmers.dll
eroller commented 6 years ago

We have seen performance problems running under mono. In that case I don't think it is related to the COMPlus_gcAllowVeryLargeObjects=1 setting.

BenoitFiset commented 6 years ago

Finally did the manual conversion. Does the final output file need to be indexed ?

eroller commented 6 years ago

I don't think so, but it may run faster with the index so might as well generate it. The GenomeSize.xml and genome.fa files would also need to be updated to have the correct contig names.

sbamin commented 5 years ago

I am using FlagUniqueKmers for canine genome, ~ 2.8 Gb size (38 autosomes, ~3200 scaffolds). When I used merged fasta, build time was longer (I killed job after 2 hours or so, and still at chr 1). However, if I parallelize per-chomosome (20 jobs at once), all of 3,268 chromosome-level (except 38 autosomes, most are tiny scaffolds of a few KB) finished in < 1 hour. log files look ok with exit code zero and Flagging complete line. I am using dotnet 2.1 and Canvas-1.39.0.1598+master_x64 on Cent OS 7. It does not generate GenomeSize.xml though. Does it need an extra command or should I be making that xml manually?

Just making sure that resulting kmer.fa per chromosome is valid given I find job finished in < 1 hour.

## for each chromosome
dotnet "${CANVAS_DIR}"/Tools/FlagUniqueKmers/FlagUniqueKmers.dll "${MYCHR_PATH}" "${MYCHR}"_canvas.fa
eroller commented 5 years ago

You need to manually create the xml file.

Running FlagUniqueKmers.dll on each chromosome is not valid since it would not check for kmer uniqueness across contigs (only within each contig).

sbamin commented 5 years ago

Got it and thanks for that pointer.