LosicLab / starchip

Detection of Circular RNA and Fusions from RNA-Seq
http://starchimp.readthedocs.io/en/latest/
MIT License
32 stars 11 forks source link

Step 2 error during annotation step #23

Open smshuai opened 5 years ago

smshuai commented 5 years ago

Hello,

I just tried out starchip. I have no problem with step 1

$ ./Step1.sh

Filtering out chimeric reads that appear circular
Filtering circular reads based on assigned thresholds of 5 reads in 10 individuals
creating circRNA fasta and gtf for re-alignment and quantification

But step 2 gave me errors:

$ ./Step2.sh

Creating a count matrix. This may take a long time for very large cohorts.
annotating your cRNA
Use of uninitialized value $p2Info in split at starchip/scripts/circles/summarize_geneinfo.pl line 12, <> line 77.
Use of uninitialized value in string eq at starchip/scripts/circles/summarize_geneinfo.pl line 20, <> line 77.
Use of uninitialized value in string eq at starchip/scripts/circles/summarize_geneinfo.pl line 20, <> line 77.
[truncating error msgs]
Use of uninitialized value $GeneID in concatenation (.) or string at starchip/scripts/circles/summarize_geneinfo.pl line 104, <> line 78.
Use of uninitialized value $strand in concatenation (.) or string at starchip/scripts/circles/summarize_geneinfo.pl line 104, <> line 78.
generating counts per million, transforming your data with voom, and generating plots
Now splicing your cRNA

Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,  :
  line 77 did not have 6 elements
Calls: read.table -> scan
Execution halted

I am using GENCODE v19 GTF and the first two lines of the gencode.v19.annotation.gtf.exons.bed are:

chr1    11869   12227   gene_id:"ENSG00000223972.4";transcript_id:"ENST00000456328.2";gene_type:"pseudogene";gene_status:"KNOWN";gene_name:"DDX11L1";transcript_type:"processed_transcript";transcript_status:"KNOWN";transcript_name:"DDX11L1-002";exon_number:1;exon_id:"ENSE00002234944.1";level:2;tag:"basic";havana_gene:"OTTHUMG00000000961.2";havana_transcript:"OTTHUMT00000362751.1";  .   +   HAVANAexon
chr1    11872   12227   gene_id:"ENSG00000223972.4";transcript_id:"ENST00000515242.2";gene_type:"pseudogene";gene_status:"KNOWN";gene_name:"DDX11L1";transcript_type:"transcribed_unprocessed_pseudogene";transcript_status:"KNOWN";transcript_name:"DDX11L1-201";exon_number:1;exon_id:"ENSE00002234632.1";level:3;havana_gene:"OTTHUMG00000000961.2"; .   +   ENSEMBL exon

Let me know if you need more information to debug.

Thanks, Shimin

kippakers commented 5 years ago

Hi Shimin,

Thanks for your message! Sorry to delay in response! This is a new one, not sure the cause. But I'd like to see the lines 70-80 of circRNA.5reads.10ind.annotated. If you'd rather not put it on github, my email is nicholas.kipp.akers@gmail.com .

Thanks! Kipp

smshuai commented 5 years ago

Hi Kipp,

I see there are some malformed rows in the file you mentioned:

chr9:88920107-88924932  gene_id:"ENSG00000083223.13";transcript_id:"ENST00000277141.6";gene_type:"protein_coding";gene_status:"KNOWN";gene_name:"ZCCHC6";transcript_type:"protein_coding";transcript_status:"KNOWN";transcript_name:"ZCCHC6-003";exon_number:24;exon_id:"ENSE00001171520.1";level:2;protein_id:"ENSP00000277141.6";tag:"basic";havana_gene:"OTTHUMG00000020137.1";havana_transcript:"OTTHUMT00000052920.1"; -   0   gene_id:"ENSG00000083223.13";transcript_id:"ENST00000277141.6";gene_type:"protein_coding";gene_status:"KNOWN";gene_name:"ZCCHC6";transcript_type:"protein_coding";transcript_status:"KNOWN";transcript_name:"ZCCHC6-003";exon_number:20;exon_id:"ENSE00000710328.1";level:2;protein_id:"ENSP00000277141.6";tag:"basic";havana_gene:"OTTHUMG00000020137.1";havana_transcript:"OTTHUMT00000052920.1"; -   0
GL000220.1:112552-113003    gene_name:NoGene    NA  NA  gene_name:NoGeneNA  NA
GL000220.1:112552-156975    gene_name:NoGene    NA  NA  gene_name:NoGeneNA  NA
Lentiviral:512-5258
Lentiviral:512-6530
chr11:46098305-46113774 gene_id:"ENSG00000135365.11";transcript_id:"ENST00000257821.4";gene_type:"protein_coding";gene_status:"KNOWN";gene_name:"PHF21A";transcript_type:"protein_coding";transcript_status:"KNOWN";transcript_name:"PHF21A-201";exon_number:5;exon_id:"ENSE00003588041.1";level:3;protein_id:"ENSP00000257821.4";tag:"basic";tag:"appris_candidate_longest";havana_gene:"OTTHUMG00000167038.2";    -   0   gene_id:"ENSG00000135365.11";transcript_id:"ENST00000257821.4";gene_type:"protein_coding";gene_status:"KNOWN";gene_name:"PHF21A";transcript_type:"protein_coding";transcript_status:"KNOWN";transcript_name:"PHF21A-201";exon_number:2;exon_id:"ENSE00001421291.1";level:3;protein_id:"ENSP00000257821.4";tag:"basic";tag:"appris_candidate_longest";havana_gene:"OTTHUMG00000167038.2";    -   0

I added a customized chromosome (a Lentiviral vector) to the reference hg19 FASTA for STAR. Could this be the cause? If so, is there any easy workaround?

Best, Shimin

cursecatcher commented 5 years ago

Hello, I have a problem very similar to the one of @smshuai. I got an high number of warning messages about uninitialized values and I saw that the corresponding lines in the circRNA.5reads.10ind.annotated file are malformed. Specifically, those lines contain only the first field, which is the genome coordinate of the circRNA, I guess. Strangely, malformed lines involve only a subset of circRNAs on chromosomes 9 and X. I'm using hg19 genome and annotation file downloaded from ensembl.

Here some lines of warning messages maybe useful for debug purposes.

Creating a count matrix. This may take a long time for very large cohorts. annotating your cRNA Use of uninitialized value in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 20, <> line 7169. Use of uninitialized value in string ne at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 39, <> line 7169. Use of uninitialized value $p2Distance in numeric ne (!=) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 41, <> line 7169. Use of uninitialized value $p2Distance in numeric eq (==) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 46, <> line 7169. Use of uninitialized value $p2Distance in numeric eq (==) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 52, <> line 7169. Use of uninitialized value in concatenation (.) or string at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 53, <> line 7169. Use of uninitialized value $p2Strand in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 55, <> line 7169. Use of uninitialized value $p2Strand in concatenation (.) or string at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 58, <> line 7169. Use of uninitialized value $p2Info in split at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 12, <> line 7170. Use of uninitialized value in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 20, <> line 7170. Use of uninitialized value in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 20, <> line 7170. Use of uninitialized value $p1Distance in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 24, <> line 7170. Use of uninitialized value $p1Distance in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 24, <> line 7170. Use of uninitialized value $p1Distance in numeric eq (==) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 27, <> line 7170. Use of uninitialized value $p2Distance in numeric eq (==) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 27, <> line 7170. Use of uninitialized value $GeneID in substitution (s///) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 74, <> line 7170. Use of uninitialized value $GeneID in substitution (s///) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 75, <> line 7170. Use of uninitialized value $GeneID in substitution (s///) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 76, <> line 7170. Use of uninitialized value $GeneID in substitution (s///) at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 77, <> line 7170. Use of uninitialized value $strand in string eq at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 98, <> line 7170. Use of uninitialized value $GeneID in concatenation (.) or string at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 104, <> line 7170. Use of uninitialized value $strand in concatenation (.) or string at /bin/starchip/scripts/circles/summarize_geneinfo.pl line 104, <> line 7170.

It seems that those warnings don't compromise starchip algorithm, but I am not 100% sure of that.

Best regards, Nicola

kippakers commented 5 years ago

@smshuai

Hi Again Shimin, Yes, that's the cause of the issue alright. It's a little odd though, when starchip doesn't find a gene in your gtf.exons.bed file, it should annotate with "NoGene", like your GL000220.1:112552-113003 line. But your Lentiviral circRNA are showing up. Are there lines in your gtf.exons.bed file for Lentiviral? ie if you run

fgrep -w Lentiviral  gencode.v19.annotation.gtf.exons.bed

does it have unusual lines?

The quick fix is to add the same "NoGene" annotations that GL000220.1:112552-113003 has to your Lentiviral lines in circRNA.5reads.10ind.annotated. Then run the remaining lines from Step2.sh, which should be something like:

tail -n +2 circRNA.5reads.10ind.annotated | /path/to/starchip/scripts/circles/summarize_geneinfo.pl > circRNA.5reads.10ind.genes
for cutoff in SOMETHING ; do
    Rscript /path/to/starchip/scripts/circles/cpm2.R circRNA.5reads.10ind.countmatrix ${cpmcutoff} ${subjectcpm} circRNA.5reads.10ind.genes &
done

etc. etc.

I hope that helps! Kipp

kippakers commented 5 years ago

Hi Nicola the @cursecatcher ,

Thanks for adding your input. Maybe I've got a bug on my hands here. Can you send me an offending circRNA coordinate, and let me know what your reference fasta and gtf are (Hopefully from somewhere I can download them to test).

You're right, this shouldn't impact STARChips output too much. Obviously no gene annotations for those files, but that's it.

cursecatcher commented 5 years ago

Hi Kipp, thank you for your quick response.

Here my reference and gtf files: ftp://ftp.ensembl.org/pub/grch37/update/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/grch37/update/gtf/homo_sapiens/Homo_sapiens.GRCh37.87.gtf.gz

And here some coordinates:

9:139243149-139244215                   
X:153996577-153997585

Please let me know if you need other data. Thank you again. Nicola

kippakers commented 5 years ago

Hi Again Nicola,

I can't recreate the issue. What happens for you when you do this? You'll have to fix the paths for the .sh and .pl script (they are in starchip/scripts/circles/), as well as the reference bed file.

echo "cRNA\n9:139243149-139244215\nX:153996577-153997585" > circs.txt
./cRNA_coordinates2genes.sh circs.txt ~/data/ref/Homo_sapiens.GRCh37.87.gtf.exons.bed  && tail -n +2 circs.txt.annotated | ./summarize_geneinfo.pl 

I get:

ID  Gene(s) Exon1   Exon2   Strand  SpliceType
9:139243149-139244215   GPSM1   9   10  +   Exon-Exon
X:153996577-153997585   DKC1    1   9   +   Exon-Exon
cursecatcher commented 5 years ago

Hi Kipp, this is what I get, as well as a high number of perl warnings like the ones of the previous post.

ID  Gene(s) Exon1   Exon2   Strand  SpliceType
9:139243149-139244215       NA  NA      Exon-Exon
X:153996577-153997585       NA  NA      Exon-Exon

Maybe I found the problem. I notice that at the top there is this error message.

Error: requested chromosome GL000191.1 does not exist in the genome file /genome/genome.gtf.exons.genome. Exiting.

I made some researches: this is the error message of bedtools closest, called in cRNA_coordinates2genes.sh. Then, I returned to the log file created during genome downloading and indexing. There are a lot of warning messages from rsem-extract-reference-transcripts such as

Warning: Cannot extract transcript ENST00000374457's sequence since the chromosome it locates, GL000191.1, is absent!

and the final message:

Warning: 184 transcripts are failed to extract because their chromosome sequences are absent.

In fact my fasta file contains only canonical chromosomes, while in the gtf there are also unplaced contigs. On groups.google I found that I have either to filter out the gtf, or add the missing fasta records.

Could this be the cause? Now I'll try to restart from the beginning, but I am not sure if this will resolve my problem. Despite the errors I already obtained a large number of circRNAs, except for this little fraction in chromosomes 9 and X.

Thank you for your time, Kipp. Kind regards, Nicola

kippakers commented 5 years ago

Hi Nicola,

Yes, you've solved it! The STAR developer strongly suggests performing alignments using the "primary" fasta only (see PRI on gencode for example). That is, chromosomes and unplaced contigs, but not alternative loci. You definitely want to match your gtf to this as well.

I would restart from the beginning, using the primary assembly fasta + gtf to build STAR then run STARChip.

Good luck! Kipp