gbnci commented 6 years ago

As all of our sequencing data were aligned using UCSC genome specifically with hg19. We have tried (very hard) to make the compatible cache files, still unsuccessful. I am wondering whether someone can give me more detailed information to make our effort work. Or could you please let me know where to download the ones if available. We have tried to use the Ensambl version of the cache files, even though I have tried to replace the chromosome notation from 1,2,3,... to chr1, chr2, et al, I still failed to annotate my result. Thanks for the help. Yonghong

AndyMenzies commented 6 years ago

Hi Yonghong

Which Ensembl release are you trying to generate the cache files against? (the current release would be 93)

And could I also get 2 other pieces of information to help trouble shoot.

Could you let me see the chromosome 1 line from the fasta index (fa.fai) file for your reference genome
Could I see a VCF line for one of your variants, also from chromosome 1

Andy

gbnci commented 6 years ago

Hi, Andy: Thanks for your prompt response. For the Genome, because most of the sequencing data were generated spanning for many years, for consistence purpose, we always used ucsc hg19 genome to run the alignment. As for the corresponding Ensembl release, I am not quite sure. Specific for my current use of the cache files (downloaded from the link on your paper), I was using BRASS (also from your group). The software has create the vcf file and even bedge files, but the bedge files fails to add proper gene information in it, which I am assuming the cache files is incompatible with the genome I used. The fa file I used for BRASS originally has chromosome notation like:

1

GATCACAGG……. For the purpose of using BRASS, I changed it to (I also changed all chromosome notation even in the downloaded cache files, still no luck):

chr1

GATCACAGG……. For the fai file, it looks like:

chr1 249250621 6 60 61

chr2 243199373 253404811 60 61

chr3 198022430 500657513 60 61

chr4 191154276 701980323 60 61

Part of the VCF file generated from BRASS is:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOUR

chr1 5796393 18_1 G [chr9:6585845[G 4 . SVTYPE=BND;MATEID=18_2;BKDIST=-1;OCC=1;TSRDS=HWI-ST1108:3:1107:1714

1:87527#1,HWI-ST1108:7:1312:8009:16532#1,HWI-ST1108:7:1315:10545:75355#1,HWI-ST1108:7:2206:16268:60423#1;BKDIST=-1;SVCLASS=transloc

ation RC:PS 0:0 0:4

chr1 14352896 34_1 G G[chr1:14358623[ . . SVTYPE=BND;MATEID=34_2;IMPRECISE;CIPOS=0,4;CIEND=0,

4;HOMSEQ=ACCC;HOMLEN=4;NRDS=HWI-ST1108:2:2107:2128:70395#1/2,HWI-ST1108:2:2305:1420:62632#1/1;TRDS=DOGFISH:2:15:17210:4331#1/1,HWI-

ST1108:7:1104:7652:31839#1/2,HWI-ST1108:7:1106:5826:9055#1/2,HWI-ST1108:7:1111:8065:86904#1/1,HWI-ST1108:7:1204:3543:33082#1/2,HWI-

ST1108:7:1309:8613:87991#1/1,HWI-ST1108:7:1311:18344:45331#1/1,HWI-ST1108:7:1312:17254:50386#1/2,HWI-ST1108:7:2306:18798:3080#1/1,H

WI-ST1108:7:2307:4234:42307#1/2,HWI-ST1108:7:2314:5474:93555#1/1;TSRDS=HWI-ST1108:7:1106:5826:9055#1,HWI-ST1108:7:1204:3543:33082#1

,HWI-ST1108:7:1215:13813:33611#1,HWI-ST1108:7:1311:18344:45331#1,HWI-ST1108:7:2113:13635:65872#1,HWI-ST1108:7:2207:10465:29193#1,ST

ONE:2:63:8002:2808#1;BAS=100;BKDIST=5722;SVCLASS=deletion RC:PS 0:0 11:7

In the meantime, I also put the first few lines from the bedge file here:

chr1 5796392 5796393 chr9 6585844 6585845 18 4 - + PASYUKTumor translocation -1 HWI-ST1108:3:1107:1

7141:87527#1,HWI-ST1108:7:1312:8009:16532#1,HWI-ST1108:7:1315:10545:75355#1,HWI-ST1108:7:2206:16268:60423#1 4 0 0 1 1 0

chr1 14352895 14352900 chr1 14358622 14358627 34 7 + + PASYUK_Tumor,PASYUK_Normal del

etion 5722 100 HWI-ST1108:7:1106:5826:9055#1,HWI-ST1108:7:1204:3543:33082#1,HWI-ST1108:7:1215:13813:33611#1,HWI-ST1108:7:1311:18344:45331#

1,HWI-ST1108:7:2113:13635:65872#1,HWI-ST1108:7:2207:10465:29193#1,STONE:2:63:8002:2808#1 7 0 0 1 1 1 0 Chr.chr1 14352896(900)--ACCC--14358623(27) Chr.chr1 (score 100) . ACCC DOGFISH:2:15:17210:4331#1/1,HWI-ST1108:7:1104:7652:31839#1/

2,HWI-ST1108:7:1106:5826:9055#1/2,HWI-ST1108:7:1111:8065:86904#1/1,HWI-ST1108:7:1204:3543:33082#1/2,HWI-ST1108:7:1309:8613:87991#1/1,HWI-ST1108:7:1

311:18344:45331#1/1,HWI-ST1108:7:1312:17254:50386#1/2,HWI-ST1108:7:2306:18798:3080#1/1,HWI-ST1108:7:2307:4234:42307#1/2,HWI-ST1108:7:2314:5474:9355

5#1/1|HWI-ST1108:2:2107:2128:70395#1/2,HWI-ST1108:2:2305:1420:62632#1/1 12 _ 0

As you can see all the gene information is “_”. Again, thank you very much for help us with this annotation issue. Thanks Yonghong

From: AndyMenzies notifications@github.com Reply-To: cancerit/VAGrENT reply@reply.github.com Date: Thursday, September 20, 2018 at 11:54 AM To: cancerit/VAGrENT VAGrENT@noreply.github.com Cc: "Wang, Yonghong (NIH/NCI) [E]" wangyong@mail.nih.gov, Author author@noreply.github.com Subject: Re: [cancerit/VAGrENT] Building cache files (#32)

Hi Yonghong

Which Ensembl release are you trying to generate the cache files against? (the current release would be 93)

And could I also get 2 other pieces of information to help trouble shoot.

Could you let me see the chromosome 1 line from the fasta index (fa.fai) file for your reference genome
Could I see a VCF line for one of your variants, also from chromosome 1

Andy

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/cancerit/VAGrENT/issues/32#issuecomment-423233745, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aj2o-H-xY55B-XFDSlrYVsfPy8A7z-mhks5uc7khgaJpZM4WyWlm.

AndyMenzies commented 6 years ago

Hi Yonghong

Your data and your reference look consistent. One last little check, could you show me the first few lines of the Vagrent cache file you have been using.

Could you also have a look in your BRASS bedpe file and tell me if the gene1 or gene2 columns have anything other than '_' in them.

Andy

gbnci commented 6 years ago

Good Morning, Andy: Because all our bam files have header with chromosome notation as “chr1”, “chr2”, et al, I have to change all the chromosome notation in all files from BRASS in order to be comparable for the bam files. For the cache file, I also changed all the chromosome notation to something like:

chr1 29553 31097 ENST00000473358 MIR1302-10 712 $VAR1 = bless( {'_proteinacc' => undef,'_cdsmaxpos' => undef,'_genename' => 'MIR1302-10','_proteinaccversion' => undef,'_cdsminpos' => undef,'_cdsphase' => -1,'_cdnaseq' => undef,'_genetype' => 'lincRNA','_acc' => 'ENST00000473358','_accversion' => '1','_genomicminpos' => '29554','_ccds' => undef,'_exons' => [bless( {'_minpos' => '29554','_chr'=>'chr1','_genomeVersion' => 'GRCh37','_rnaminpos' => 1,'_rnamaxpos' => 486,'_maxpos' => '30039','_species' => 'human'}, 'Sanger::CGP::Vagrent::Data::Exon' ),bless( {'_species' => 'human','_maxpos' => '30667','_chr'=>'chr1','_genomeVersion' => 'GRCh37','_minpos' => '30564','_rnamaxpos' => 590,'_rnaminpos' => 487}, 'Sanger::CGP::Vagrent::Data::Exon' ),bless( {'_species' => 'human','_maxpos' => '31097','_rnamaxpos' => 712,'_rnaminpos' => 591,'_chr'=>'chr1','_genomeVersion' => 'GRCh37','_minpos' => '30976'}, 'Sanger::CGP::Vagrent::Data::Exon' )],'_strand' => 1,'_genomicmaxpos' => '31097','_db' => 'Ensembl','_dbversion' => 'homo_sapiens_91_37'}, 'Sanger::CGP::Vagrent::Data::Transcript' );

chr1 30266 31109 ENST00000469289 MIR1302-10 535 $VAR1 = bless( {'_genomicmaxpos' => '31109','_strand' => 1,'_dbversion' => 'homo_sapiens_91_37','_db' => 'Ensembl','_genomicminpos' => '30267','_accversion' => '1','_acc' => 'ENST00000469289','_genetype' => 'lincRNA','_exons' => [bless( {'_maxpos' => '30667','_species' => 'human','_minpos' => '30267','_chr'=>'chr1','_genomeVersion' => 'GRCh37','_rnaminpos' => 1,'_rnamaxpos' => 401}, 'Sanger::CGP::Vagrent::Data::Exon' ),bless( {'_rnaminpos' => 402,'_rnamaxpos' => 535,'_minpos' => '30976','_chr'=>'chr1','_genomeVersion' => 'GRCh37','_maxpos' => '31109','_species' => 'human'}, 'Sanger::CGP::Vagrent::Data::Exon' )],'_ccds' => undef,'_cdsphase' => -1,'_cdnaseq' => undef,'_cdsminpos' => undef,'_cdsmaxpos' => undef,'_proteinacc' => undef,'_proteinaccversion' => undef,'_genename' => 'MIR1302-10'}, 'Sanger::CGP::Vagrent::Data::Transcript' );

For the bedpe file, all columns starting from gene 1 to the end “first/last2” are missing except the “fusion_flag” are all 0. Another piece of information I should have mentioned is that in the logs directory, the last error it generated is:

Sanger_CGP_Brass_Implement_bedGraphToBigWig.0.err

bash -c 'set -o pipefail; (cat /data/CCRBioinfo/wangyh/dis_BRASS/PATAWV_test/tmpBrass/assemble/bedpe.* | sort -k1,1 -k 2,2n > /data/CCRBioin

fo/wangyh/dis_BRASS/PATAWV_test/PATAWV_Tumor_vs_PATAWV_Normal.assembled.bedpe)'

/usr/bin/perl /opt/wtsi-cgp/bin/grass.pl -genome_cache /data/CCRBioinfo/wangyh/brass_file/chr_cache/chr_vagrent.human.GRCh37.homo_sapiens_91

_37.cache.gz -ref /data/CCRBioinfo/wangyh/chr_genome.fa -species human -assembly GRCh37 -platform ILLUMINA -protocol WGS -tumour PATAWV_Tumor

-normal PATAWV_Normal -file /data/CCRBioinfo/wangyh/dis_BRASS/PATAWV_test/PATAWV_Tumor_vs_PATAWV_Normal.assembled.bedpe -add_header brassVersi

on=6.1.2

Very sorry for all the trouble and very appreciated for your help Best regards Yonghong

From: AndyMenzies notifications@github.com Reply-To: cancerit/VAGrENT reply@reply.github.com Date: Friday, September 21, 2018 at 9:50 AM To: cancerit/VAGrENT VAGrENT@noreply.github.com Cc: "Wang, Yonghong (NIH/NCI) [E]" wangyong@mail.nih.gov, Author author@noreply.github.com Subject: Re: [cancerit/VAGrENT] Building cache files (#32)

Hi Yonghong

Your data and your reference look consistent. One last little check, could you show me the first few lines of the Vagrent cache file you have been using.

Could you also have a look in your BRASS bedpe file and tell me if the gene1 or gene2 columns have anything other than '_' in them.

Andy

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/cancerit/VAGrENT/issues/32#issuecomment-423537333, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aj2o-FUNGLWNJBrdvijvBrH6pocLdr2lks5udO6cgaJpZM4WyWlm.

AndyMenzies commented 6 years ago

Hi Yonghong

The fusion_flag comes from Grass, so it is running the annotation step. And it looks like your cache file should be compatible with your data.

Most structural variants are intergenic, with only a minority overlapping with genes. If you only have a small sample set you may have been unlucky and not found something that overlaps a gene.

How many samples have you run through Brass and roughly how many variants are you seeing per sample?

You can test this yourself. The cache file format is based on bed format. You can run a simple tabix search to identify any over lapping transcripts. ie if you run

tabix /data/CCRBioinfo/wangyh/brass_file/chr_cache/chr_vagrent.human.GRCh37.homo_sapiens_91_37.cache.gz chr1:29554-29554

It should pull out ENST00000473358 from MIR1302-10

You manually double check which transcripts Grass would see by searching with the breakpoint coordinates of the SV's you have identified. You have to search each break independently, so for the example variants you showed me earlier you could search with

chr1:5796393-5796393 and chr9:6585845-6585845 chr1:14352896-14352900 and chr1:14358623-14358627

For me, only chr9:6585845-6585845 returns an overlapping transcript. Unfortunately its part of an unassembled SV and we can't annotate to those automatically as some of the data we need is missing.

Andy

gbnci commented 6 years ago

Hi, Andy: Thanks for your troubleshooting. We have 89 samples that have been pushed through BRASS and all look the same (I mean missing gene annotation information). I will double check this next week and will let you know if there is any problem. Thanks and have a nice weekend Yonghong

From: AndyMenzies notifications@github.com Reply-To: cancerit/VAGrENT reply@reply.github.com Date: Friday, September 21, 2018 at 11:49 AM To: cancerit/VAGrENT VAGrENT@noreply.github.com Cc: "Wang, Yonghong (NIH/NCI) [E]" wangyong@mail.nih.gov, Author author@noreply.github.com Subject: Re: [cancerit/VAGrENT] Building cache files (#32)

Hi Yonghong

The fusion_flag comes from Grass, so it is running the annotation step. And it looks like your cache file should be compatible with your data.

Most structural variants are intergenic, with only a minority overlapping with genes. If you only have a small sample set you may have been unlucky and not found something that overlaps a gene.

How many samples have you run through Brass and roughly how many variants are you seeing per sample?

You can test this yourself. The cache file format is based on bed format. You can run a simple tabix search to identify any over lapping transcripts. ie if you run

tabix /data/CCRBioinfo/wangyh/brass_file/chr_cache/chr_vagrent.human.GRCh37.homo_sapiens_91_37.cache.gz chr1:29554-29554

It should pull out ENST00000473358 from MIR1302-10

You manually double check which transcripts Grass would see by searching with the breakpoint coordinates of the SV's you have identified. You have to search each break independently, so for the example variants you showed me earlier you could search with

chr1:5796393-5796393 and chr9:6585845-6585845 chr1:14352896-14352900 and chr1:14358623-14358627

For me, only chr9:6585845-6585845 returns an overlapping transcript. Unfortunately its part of an unassembled SV and we can't annotate to those automatically as some of the data we need is missing.

Andy

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/cancerit/VAGrENT/issues/32#issuecomment-423581438, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Aj2o-GsKrO0PsKvDdxb0fG2FlNIacTMOks5udQp0gaJpZM4WyWlm.

keiranmraine commented 4 years ago

Closing as stale

cancerit / VAGrENT

Building cache files #32

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOUR