KeyError: 'ID' - Githubissues

NCBI-Hackathons / rnaseqview

RNA-seq Viewer Team at the NCBI-assisted Boston Genomics Hackathon

Creative Commons Zero v1.0 Universal

37 stars 17 forks source link

KeyError: 'ID' #2

Open pbpayal opened 6 years ago

pbpayal commented 6 years ago

Can I use this tool on local bam files for my project? I haven't submitted my data to SRA yet!

python counter.py --inp test/Test1.bam --out Test1_counts --gtf test/ref-transcripts.gtf

Error: Using this file annotation test/ref-transcripts.gtf samtools sort -n -O sam test/Test1.bam -o /dev/stdout | awk '$7=="="' | htseq-count -s no -i gene - test/ref-transcripts.gtf > Test1_counts.tsv Traceback (most recent call last): File "counter.py", line 150, in out_fn = normalize(out_fn, gtf) File "counter.py", line 68, in normalize size = _get_size(gtf) File "counter.py", line 32, in _get_size transcript_id = feature.attr['ID'] KeyError: 'ID'

I know its mentioned "GTF needs to have ID and gene in the attributes field.", but what do you mean by that? I tried replacing the code:

    if feature.type == "exon":
        transcript_id = feature.attr['gene_id']
        gene[transcript_id] = feature.attr['gene_name']

The program runs, but only gives empty output files!!

eweitz commented 6 years ago

@lpantano, any ideas here?

lpantano commented 6 years ago

Hi,

it should work with BAM files, I think the problem is the parsing the GTF.

It is your BAM files with the same chromosome naming than the GTF? because all files empties are weird. At least you need to get the *tsv file with something.

After that we can debug on why the transcript is not working.

Can you check that?

We are mainly running this code:

samtools sort -n -O sam {fn_in} -o /dev/stdout | awk '$7==\"=\"' | htseq-count -s no -i gene - {gtf} > {out}

that should give something with values.

Cheers

pbpayal commented 6 years ago

But I used the test files that I downloaded from this repo only..and I checked both the bam and gtf file have chr annotation!

could it be because "htseq-count -s no -i gene - {gtf} > {out}" doesn't have the sam/bam file input in the command?

lpantano commented 6 years ago

yeah, it would be something like this:

samtools sort -n -O sam Test1.bam -o /dev/stdout | awk '$7=="="' | htseq-count -s no -i gene - ref-transcripts.gtf > sample.tsv

I would try to replicate tomorrow the issue, sorry about this.

lpantano commented 6 years ago

ok, I see the issue.

@eweitz can you remember me what is the input file we need for the idiogram? it has to be entrez symbol ID and the expression?

eweitz commented 6 years ago

@lpantano, the step in our pipeline after counter.py is formatter.py, which takes an input file like SRR562645_counts_norm.tsv produced by counter.py. I believe that TSV file contains gene symbol (e.g. BRCA1) and expression.

The formatter.py script then outputs a JSON file containing custom-formatted annotations, which is the input for Ideogram.js.