cancerit / BRASS

Breakpoints via assembly - Identifies breaks and attempts to assemble rearrangements in whole genome sequencing data.
GNU Affero General Public License v3.0
57 stars 20 forks source link

centromere telomere file #101

Closed mjko1210 closed 4 years ago

mjko1210 commented 4 years ago

Hi, I would like to generate 'centromere_and_telomere_coords.txt' of hg38 for ClusterSV. I've tried to get these information from ucsc table brower.

I have centromere:

chrom chromStart chromEnd

chr1 122503247 124785432 chr1 122026459 122224535 chr1 122224635 122503147 chr1 124849229 124932724 chr1 124785532 124849129 chr2 92188145 94090557 chr3 91553419 93655574 chr3 90772458 91233586 ... And telomere,

chrom chromStart chromEnd type

chr1 0 10000 telomere chr1 248946422 248956422 telomere chr2 0 10000 telomere chr2 242183529 242193529 telomere chr3 0 10000 telomere chr3 198285559 198295559 telomere chr4 0 10000 telomere chr4 190204555 190214555 telomere ...

1) For centromere, do I take min and max among start and end per chromosome? And I also would like to know how you set cen from https://raw.githubusercontent.com/cancerit/ClusterSV/master/hg19_centromere_and_telomere_coords.txt .

2) I wasn't clear how to get ptel and qtel from the telomere above. Could you advise me which one I need to take for each qtel and ptel? It reports two coordinates per chromosome. (2 lines per chromosome) Looks like all of them (chromosome) has "0(start), 10000(end)" for their first coordinate.

Thanks! MJ

keiranmraine commented 4 years ago

Hi,

I assume you are following the documentation in the wiki:

https://github.com/cancerit/BRASS/wiki/Centromere-Telomere-locations

It's not explicitly described but the format assumes:

keiranmraine commented 4 years ago

Alternatively you can pull the pregenerated files from our FTP site:

http://ftp.sanger.ac.uk/pub/cancer/dockstore/human/GRCh38_hla_decoy_ebv/

Full file set under that path:

bwa_idx_GRCh38_hla_decoy_ebv.tar.gz
CNV_SV_ref_GRCh38_hla_decoy_ebv_brass6+.tar.gz
core_ref_GRCh38_hla_decoy_ebv.tar.gz
GRCh38.md5
qcGenotype_GRCh38_hla_decoy_ebv.tar.gz
README.md
SNV_INDEL_ref_GRCh38_hla_decoy_ebv-fragment.tar.gz
VAGrENT_ref_GRCh38_hla_decoy_ebv_ensembl_91.tar.gz

You would want the CNV_SV_* file if only interested in BRASS.

mjko1210 commented 4 years ago

Thanks for this! Unfortunately, I don't have access to the ftp website. I will generate my own. Could you also explain how you define cen from the file (https://raw.githubusercontent.com/cancerit/ClusterSV/master/hg19_centromere_and_telomere_coords.txt)? It's one of required column for ClusterSV. Thanks!

keiranmraine commented 4 years ago

How to define centromere is described above/in wiki:

https://github.com/cancerit/BRASS/issues/101#issuecomment-653210101

To access the ftp site you need to request a full filepath, not just the root, so:

http://ftp.sanger.ac.uk/pub/cancer/dockstore/human/GRCh38_hla_decoy_ebv/CNV_SV_ref_GRCh38_hla_decoy_ebv_brass6+.tar.gz