broadinstitute / gtex-pipeline

GTEx & TOPMed data production and analysis pipelines
BSD 3-Clause "New" or "Revised" License
339 stars 175 forks source link

Missing files and script errors #81

Closed geng-lee closed 1 year ago

geng-lee commented 1 year ago

H i,

I have some problems with this pipeline, hope you can help me

TOPMed_RNAseq_pipeline.md Reference annotation 4.The ERCC annotation was appended to the reference GTFs: gencode.v34.GRCh38.genes.gtf ???

where is this file downloaded from, or how to make it.

Is it the same as the file in the link below? https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_41/gencode.v41.basic.annotation.gtf.gz

3.Gene- and transcript-level attributes were added to the ERCC GTF with the following Python code: python script error IndentationError: expected an indented block. Is it correct to change it to the following?

with open('ERCC92.gtf') as exon_gtf, open('ERCC92.genes.patched.gtf', 'w') as gene_gtf:
    for line in exon_gtf:
        f = line.strip().split('\t')
        f[0] = f[0].replace('-','_')  # required for RNA-SeQC/GATK (no '-' in contig name)

        attr = f[8]
        if attr[-1]==';':
            attr = attr[:-1]
        attr = dict([i.split(' ') for i in attr.replace('"','').split('; ')])
        # add gene_name, gene_type
        attr['gene_name'] = attr['gene_id']
        attr['gene_type'] = 'ercc_control'
        attr['gene_status'] = 'KNOWN'
        attr['level'] = 2
        for k in ['id', 'type', 'name', 'status']:
            attr['transcript_'+k] = attr['gene_'+k]

        attr_str = []
        for k in ['gene_id', 'transcript_id', 'gene_type', 'gene_status', 'gene_name',
            'transcript_type', 'transcript_status', 'transcript_name']:
            attr_str.append('{0:s} "{1:s}";'.format(k, attr[k]))
        attr_str.append('{0:s} {1:d};'.format('level', attr['level']))
        f[8] = ' '.join(attr_str)

        # write gene, transcript, exon
        gene_gtf.write('\t'.join(f[:2]+['gene']+f[3:])+'\n')
        gene_gtf.write('\t'.join(f[:2]+['transcript']+f[3:])+'\n')
        f[8] = ' '.join(attr_str[:2])
        gene_gtf.write('\t'.join(f[:2]+['exon']+f[3:])+'\n')

best, jamie

francois-a commented 1 year ago

Please see this section for information on where the reference annotation was downloaded from, and how it was processed. Thanks for pointing out the indentation error in the readme. The patched ERCC files are now available here.