bioinfo-pf-curie / TMB

Tumor Mutational Burden
Other
49 stars 15 forks source link

pyEffGenomeSize.py error #14

Closed wt12318 closed 1 year ago

wt12318 commented 1 year ago

Hi, When I use pyEffGenomeSize.py to calculate the genome size, the follow error appeared:

##the code I run
pyEffGenomeSize.py --bed /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/KAPA_target_sort.bed --gtf /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/tmb/gencode.v42.annotation.gtf --bam /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/cnv_pool/all_bam_markdup/tumor/H208317.md.bam --minCoverage 15 --thread 10 --mosdepth --oprefix /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/tmb/pyeffg --verbose

##error
[E::hts_open_format] Failed to open file "/public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/tmb/pyeffg.intersect.bed" : No such file or directory
SIGSEGV: Illegal storage access. (Attempt to read from nil?)

gzip: /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/tmb/pyeffg.thresholds.bed.gz: unexpected end of file
[RUNNING INFO]: Loading data

[RUNNING INFO]: filtering GTF file

[RUNNING INFO]: running mosdepth ...

Traceback (most recent call last):
  File "/public/slst/home/wutao2/miniconda3/envs/pytmb/bin/pyEffGenomeSize.py", line 194, in <module>
    getEffGenomeSizeFromMosdepth(args.oprefix +".thresholds.bed")
  File "/public/slst/home/wutao2/miniconda3/envs/pytmb/bin/pyEffGenomeSize.py", line 89, in getEffGenomeSizeFromMosdepth
    with open(infile, 'rt') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/public/slst/home/wutao2/sun_data/nf_pipeline/KAPA/tmb/pyeffg.thresholds.bed'

From this issue (https://github.com/bioinfo-pf-curie/TMB/issues/5) I add the --filterNonCoding, but it still has error:

[RUNNING INFO]: Loading data

[RUNNING INFO]: filtering GTF file

Traceback (most recent call last):
  File "/public/slst/home/wutao2/miniconda3/envs/pytmb/bin/pyEffGenomeSize.py", line 146, in <module>
    feature_filteredGtf = filteredGtf.filter(filterFeatureGtf, featuretypes).saveas("filtered_gtf.gtf")
  File "/public/slst/home/wutao2/miniconda3/envs/pytmb/lib/python3.9/site-packages/pybedtools/bedtool.py", line 917, in decorated
    result = method(self, *args, **kwargs)
  File "/public/slst/home/wutao2/miniconda3/envs/pytmb/lib/python3.9/site-packages/pybedtools/bedtool.py", line 3341, in saveas
    fn = self._collapse(
  File "/public/slst/home/wutao2/miniconda3/envs/pytmb/lib/python3.9/site-packages/pybedtools/bedtool.py", line 1416, in _collapse
    for i in iterable:
  File "pybedtools/cbedtools.pyx", line 759, in pybedtools.cbedtools.IntervalIterator.__next__
  File "/public/slst/home/wutao2/miniconda3/envs/pytmb/lib/python3.9/site-packages/pybedtools/bedtool.py", line 962, in <genexpr>
    return BedTool((f for f in self if func(f, *args, **kwargs)))
  File "pybedtools/cbedtools.pyx", line 759, in pybedtools.cbedtools.IntervalIterator.__next__
  File "/public/slst/home/wutao2/miniconda3/envs/pytmb/lib/python3.9/site-packages/pybedtools/bedtool.py", line 962, in <genexpr>
    return BedTool((f for f in self if func(f, *args, **kwargs)))
  File "/public/slst/home/wutao2/miniconda3/envs/pytmb/bin/pyEffGenomeSize.py", line 70, in filterGtf
    if interval.attrs['transcript_type'] in featuretype:
KeyError: 'transcript_type'

But my gtf file has transcript_type column:

head gencode.v42.annotation.gtf 
##description: evidence-based annotation of the human genome (GRCh38), version 42 (Ensembl 108)
##provider: GENCODE
##contact: gencode-help@ebi.ac.uk
##format: gtf
##date: 2022-07-20
chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000290825.1"; gene_type "lncRNA"; gene_name "DDX11L2"; level 2; tag "overlaps_pseudogene";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    12613   12721   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 2; exon_id "ENSE00003582793.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    13221   14409   .       +       .       gene_id "ENSG00000290825.1"; transcript_id "ENST00000456328.2"; gene_type "lncRNA"; gene_name "DDX11L2"; transcript_type "lncRNA"; transcript_name "DDX11L2-202"; exon_number 3; exon_id "ENSE00002312635.1"; level 2; transcript_support_level "1"; tag "basic"; tag "Ensembl_canonical"; havana_transcript "OTTHUMT00000362751.1";

How can I fix this problem?

Thank you.

tomgutman commented 1 year ago

Hello,

From the first part of your message, it seems you don't use the --oprefix parameter correctly. This parameter require juste a name, not a path. Could you try the same command without this parameter ? Or could you try to go in the folder /public/slst/home/wutao2/sun_data/nf_pipeline/KAPA and run the appropriate command line ?

For the second part of your message I am not sure about the problem. Could you specify the full command line you used ?

Thanks Best Tom

wt12318 commented 1 year ago

I use the updated code, it works. I installed from conda, the error is from the code of version released 1.3.0:

def filterGtf(interval, featuretype):
    """
    function using pybedtools nomenclature to filter transcript_types from gtf file with defined features.
    """
    if interval.attrs['transcript_type'] in featuretype:
        return interval.attrs['transcript_type']
    return False

The updated version is :

def filterGtf(interval, featuretype):
    """
    function using pybedtools nomenclature to filter transcript_types from gtf file with defined features.
    """
    if 'transcript_type' in interval.attrs:
        if interval.attrs['transcript_type'] in featuretype:
            return interval.attrs['transcript_type']
    return False

So I used the code from Github version. Thanks.