cgat-developers / cgat-flow

cgat-flow repository
MIT License
13 stars 9 forks source link

RNASeqQC: transcript FASTA (ensembl98) #112

Closed jscaber closed 3 years ago

jscaber commented 4 years ago

Hi,

just noticed that this pipeline produces an unusual transcript FASTA with the new annotations (build 98)... The transcript FASTA has grown from 150MB to 4GB, with some transcripts 200K long, with salmon giving a warning during library construction.

def buildTranscriptFasta(infile, outfile):
    """build geneset where all exons within a gene
    are merged.
    """
    dbname = outfile[:-len(".fasta")]

    statement = '''zcat %(infile)s
    | cgat gff2fasta
    --is-gtf
    --genome=%(genome_dir)s/%(genome)s
    --log=%(outfile)s.log
    | cgat index_fasta
    %(dbname)s --force-output -
    > %(dbname)s.log
    '''
    P.run(statement)
jscaber commented 4 years ago

Also happens with ensembl 88 which used to work. Is this another arparse error?

Acribbs commented 4 years ago

Hmm it could be, although there are tests for gff2fasta that should have failed if this was the case. If you revert the changes back to optparse (cgat-core should support both) and re-run this should answer the question. I can have a look into it when I have some time.

jscaber commented 4 years ago

I have tried to work this out.The error seems to be upstream - the first file created by the rnaseqqc pipeline in the geneset.dir directory is the geneset.gtf and that is already different between Late 2018 and now.

I think this is what might have happened (not 100% sure yet), and definitely nothing to do with recent argument parsing refactoring.

  1. What used to happen is that buildReferenceGeneSet called mapping.mergeandfilterGTF and then mapping.resetGTFattributes to filter the geneset_all (43MB) from the genset pipeline. For some reason (although it shouldn't) this procedure resulted in a GTF of a size of 13MB. I am not clear what was filtered, but this is much smaller than even the geneset_coding_exons from the genesets pipeline so probably can't be right?

  2. This was refactored 10 months ago by Andreas, and the mapping.resetGTFattributes was removed. Now the buildReferenceGeneSet after the refactoring results in a file that is 31MB of size.

This has resulted in a completely different GTF being passed downstream. e.g. the refcoding GTF in this pipeline (command gtf2gtf --filter) is 30MB vs 10MB in the old code.

I think that change in the code at this stage might well be correct?

However the refcoding.fasta which is generated (now 4GB vs 120MB) is the wrong file to be supplying to salmon/sailfish as it should only contain the transcriptome which it clearly doesn't.

Do we need another reference.dir in this pipeline or can it take what it needs from genesets pipeline directly?

jscaber commented 4 years ago

As an aside, this problem would not affect the diffexpression pipeline where the user provides the gtf (though maybe it should be made clear in the pipeline that providing a genome gtf is incompatible with the salmon/sailfish).

I personally have used "geneset_name/ensembl.dir/geneset_coding_exons.gtf.gz", which is 20MB to start with and worked well for the salmon/sailfish index generation. Just putting this out there in case anyone used this pipeline in the past with a combination of geneset_all and salmon/sailfish.