Open smshuai opened 5 years ago
Hi Shimin,
Thanks for your message! Sorry to delay in response! This is a new one, not sure the cause. But I'd like to see the lines 70-80 of circRNA.5reads.10ind.annotated. If you'd rather not put it on github, my email is nicholas.kipp.akers@gmail.com .
Thanks! Kipp
Hi Kipp,
I see there are some malformed rows in the file you mentioned:
chr9:88920107-88924932 gene_id:"ENSG00000083223.13";transcript_id:"ENST00000277141.6";gene_type:"protein_coding";gene_status:"KNOWN";gene_name:"ZCCHC6";transcript_type:"protein_coding";transcript_status:"KNOWN";transcript_name:"ZCCHC6-003";exon_number:24;exon_id:"ENSE00001171520.1";level:2;protein_id:"ENSP00000277141.6";tag:"basic";havana_gene:"OTTHUMG00000020137.1";havana_transcript:"OTTHUMT00000052920.1"; - 0 gene_id:"ENSG00000083223.13";transcript_id:"ENST00000277141.6";gene_type:"protein_coding";gene_status:"KNOWN";gene_name:"ZCCHC6";transcript_type:"protein_coding";transcript_status:"KNOWN";transcript_name:"ZCCHC6-003";exon_number:20;exon_id:"ENSE00000710328.1";level:2;protein_id:"ENSP00000277141.6";tag:"basic";havana_gene:"OTTHUMG00000020137.1";havana_transcript:"OTTHUMT00000052920.1"; - 0
GL000220.1:112552-113003 gene_name:NoGene NA NA gene_name:NoGeneNA NA
GL000220.1:112552-156975 gene_name:NoGene NA NA gene_name:NoGeneNA NA
Lentiviral:512-5258
Lentiviral:512-6530
chr11:46098305-46113774 gene_id:"ENSG00000135365.11";transcript_id:"ENST00000257821.4";gene_type:"protein_coding";gene_status:"KNOWN";gene_name:"PHF21A";transcript_type:"protein_coding";transcript_status:"KNOWN";transcript_name:"PHF21A-201";exon_number:5;exon_id:"ENSE00003588041.1";level:3;protein_id:"ENSP00000257821.4";tag:"basic";tag:"appris_candidate_longest";havana_gene:"OTTHUMG00000167038.2"; - 0 gene_id:"ENSG00000135365.11";transcript_id:"ENST00000257821.4";gene_type:"protein_coding";gene_status:"KNOWN";gene_name:"PHF21A";transcript_type:"protein_coding";transcript_status:"KNOWN";transcript_name:"PHF21A-201";exon_number:2;exon_id:"ENSE00001421291.1";level:3;protein_id:"ENSP00000257821.4";tag:"basic";tag:"appris_candidate_longest";havana_gene:"OTTHUMG00000167038.2"; - 0
I added a customized chromosome (a Lentiviral vector) to the reference hg19 FASTA for STAR. Could this be the cause? If so, is there any easy workaround?
Best, Shimin
Hello, I have a problem very similar to the one of @smshuai. I got an high number of warning messages about uninitialized values and I saw that the corresponding lines in the circRNA.5reads.10ind.annotated file are malformed. Specifically, those lines contain only the first field, which is the genome coordinate of the circRNA, I guess. Strangely, malformed lines involve only a subset of circRNAs on chromosomes 9 and X. I'm using hg19 genome and annotation file downloaded from ensembl.
Here some lines of warning messages maybe useful for debug purposes.
Creating a count matrix. This may take a long time for very large cohorts. annotating your cRNA Use of uninitialized value in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 20, <> line 7169. Use of uninitialized value in string ne at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 39, <> line 7169. Use of uninitialized value $p2Distance in numeric ne (!=) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 41, <> line 7169. Use of uninitialized value $p2Distance in numeric eq (==) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 46, <> line 7169. Use of uninitialized value $p2Distance in numeric eq (==) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 52, <> line 7169. Use of uninitialized value in concatenation (.) or string at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 53, <> line 7169. Use of uninitialized value $p2Strand in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 55, <> line 7169. Use of uninitialized value $p2Strand in concatenation (.) or string at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 58, <> line 7169. Use of uninitialized value $p2Info in split at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 12, <> line 7170. Use of uninitialized value in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 20, <> line 7170. Use of uninitialized value in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 20, <> line 7170. Use of uninitialized value $p1Distance in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 24, <> line 7170. Use of uninitialized value $p1Distance in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 24, <> line 7170. Use of uninitialized value $p1Distance in numeric eq (==) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 27, <> line 7170. Use of uninitialized value $p2Distance in numeric eq (==) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 27, <> line 7170. Use of uninitialized value $GeneID in substitution (s///) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 74, <> line 7170. Use of uninitialized value $GeneID in substitution (s///) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 75, <> line 7170. Use of uninitialized value $GeneID in substitution (s///) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 76, <> line 7170. Use of uninitialized value $GeneID in substitution (s///) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 77, <> line 7170. Use of uninitialized value $strand in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 98, <> line 7170. Use of uninitialized value $GeneID in concatenation (.) or string at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 104, <> line 7170. Use of uninitialized value $strand in concatenation (.) or string at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 104, <> line 7170.
It seems that those warnings don't compromise starchip algorithm, but I am not 100% sure of that.
Best regards, Nicola
@smshuai
Hi Again Shimin, Yes, that's the cause of the issue alright. It's a little odd though, when starchip doesn't find a gene in your gtf.exons.bed file, it should annotate with "NoGene", like your GL000220.1:112552-113003 line. But your Lentiviral circRNA are showing up. Are there lines in your gtf.exons.bed file for Lentiviral? ie if you run
fgrep -w Lentiviral gencode.v19.annotation.gtf.exons.bed
does it have unusual lines?
The quick fix is to add the same "NoGene" annotations that GL000220.1:112552-113003 has to your Lentiviral lines in circRNA.5reads.10ind.annotated. Then run the remaining lines from Step2.sh, which should be something like:
tail -n +2 circRNA.5reads.10ind.annotated | /path/to/starchip/scripts/circles/summarize_geneinfo.pl > circRNA.5reads.10ind.genes
for cutoff in SOMETHING ; do
Rscript /path/to/starchip/scripts/circles/cpm2.R circRNA.5reads.10ind.countmatrix ${cpmcutoff} ${subjectcpm} circRNA.5reads.10ind.genes &
done
etc. etc.
I hope that helps! Kipp
Hi Nicola the @cursecatcher ,
Thanks for adding your input. Maybe I've got a bug on my hands here. Can you send me an offending circRNA coordinate, and let me know what your reference fasta and gtf are (Hopefully from somewhere I can download them to test).
You're right, this shouldn't impact STARChips output too much. Obviously no gene annotations for those files, but that's it.
Hi Kipp, thank you for your quick response.
Here my reference and gtf files: ftp://ftp.ensembl.org/pub/grch37/update/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/grch37/update/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.gtf.gz
And here some coordinates:
9:139243149-139244215
X:153996577-153997585
Please let me know if you need other data. Thank you again. Nicola
Hi Again Nicola,
I can't recreate the issue. What happens for you when you do this? You'll have to fix the paths for the .sh and .pl script (they are in starchip/scripts/circles/), as well as the reference bed file.
echo "cRNA\n9:139243149-139244215\nX:153996577-153997585" > circs.txt
./cRNA_coordinates2genes.sh circs.txt ~/data/ref/Homo_sapiens.GRCh37.87.gtf.exons.bed && tail -n +2 circs.txt.annotated | ./summarize_geneinfo.pl
I get:
ID Gene(s) Exon1 Exon2 Strand SpliceType
9:139243149-139244215 GPSM1 9 10 + Exon-Exon
X:153996577-153997585 DKC1 1 9 + Exon-Exon
Hi Kipp, this is what I get, as well as a high number of perl warnings like the ones of the previous post.
ID Gene(s) Exon1 Exon2 Strand SpliceType
9:139243149-139244215 NA NA Exon-Exon
X:153996577-153997585 NA NA Exon-Exon
Maybe I found the problem. I notice that at the top there is this error message.
Error: requested chromosome GL000191.1 does not exist in the genome file /genome/genome.gtf.exons.genome. Exiting.
I made some researches: this is the error message of bedtools closest, called in cRNA_coordinates2genes.sh. Then, I returned to the log file created during genome downloading and indexing. There are a lot of warning messages from rsem-extract-reference-transcripts such as
Warning: Cannot extract transcript ENST00000374457's sequence since the chromosome it locates, GL000191.1, is absent!
and the final message:
Warning: 184 transcripts are failed to extract because their chromosome sequences are absent.
In fact my fasta file contains only canonical chromosomes, while in the gtf there are also unplaced contigs. On groups.google I found that I have either to filter out the gtf, or add the missing fasta records.
Could this be the cause? Now I'll try to restart from the beginning, but I am not sure if this will resolve my problem. Despite the errors I already obtained a large number of circRNAs, except for this little fraction in chromosomes 9 and X.
Thank you for your time, Kipp. Kind regards, Nicola
Hi Nicola,
Yes, you've solved it! The STAR developer strongly suggests performing alignments using the "primary" fasta only (see PRI on gencode for example). That is, chromosomes and unplaced contigs, but not alternative loci. You definitely want to match your gtf to this as well.
I would restart from the beginning, using the primary assembly fasta + gtf to build STAR then run STARChip.
Good luck! Kipp
Hello,
I just tried out starchip. I have no problem with step 1
But step 2 gave me errors:
I am using GENCODE v19 GTF and the first two lines of the gencode.v19.annotation.gtf.exons.bed are:
Let me know if you need more information to debug.
Thanks, Shimin