YeoLab / clipper

A tool to identify CLIP-seq peaks
Other
64 stars 41 forks source link

clipper execution error #39

Open hwessels opened 8 years ago

hwessels commented 8 years ago

Hi

I am trying to run clipper on Drosophila data, providing a GTF instead of a species -s option. (Cou you provide information on how to construct a species file for other model organism)

In a test scenario I ran it on human data, which went thought with thousands of warnings, like:

WARNING:py.warnings:/gnu/store/2wsbia11jiyppc4x18w8ff8nvwjhp5na-clipper-0.3.0/lib/python2.7/site-packages/clipper-0.2.0-py2.7-linux-x86_64.egg/clipper/src/call_peak.py:597: VisibleDeprecationWarning: boolean index did not match indexed array along dimension 0; dimension is 224 but corresponding boolean dimension is 9
  peak_center = [x + peak_start for x in self.xRange[self.find_local_maxima(spline_values[peak_start:(peak_stop + 1)])]]

However, it produces output.

Given i wanna now run it on Drosophila data, i got from the clipper help prompt, that providing a GTF might be sufficient (--gtfFile=GTFFILE use a gtf file instead of the AS structure data) I am not quite sure what AS structure data is. However, I am assuming that clipper reads the GTF file and constructs customBED,customMRNA and customPREMRNA

Given that GTF files can be different from each other, depending from where you download them, what are the GTF requirements?

Or am I doing sty completely wrong here?:

$ clipper -b ../INPUT.filtered.bam --gtfFile /path/to/BDGP6_Ensembl_release81/GTF/Drosophila_melanogaster.BDGP6.81.gtf -o test
  
Traceback (most recent call last): 
  File "/gnu/store/2wsbia11jiyppc4x18w8ff8nvwjhp5na-clipper-0. 3.0/bin/.clipper-real", line 9, in  
    load_entry_point('clipper==0.2.0', 'console_scripts', 'clipper')() 
  File "/gnu/store/2wsbia11jiyppc4x18w8ff8nvwjhp5na-clipper-0. 3.0/lib/python2.7/site-packages/clipper-0.2.0-py2.7-linux-x86 _64.egg/clipper/src/peakfinder.py", line 687, in call_main 
    main(options) 
  File "/gnu/store/2wsbia11jiyppc4x18w8ff8nvwjhp5na-clipper-0. 3.0/lib/python2.7/site-packages/clipper-0.2.0-py2.7-linux-x86 _64.egg/clipper/src/peakfinder.py", line 544, in main 
    gene_tool = build_transcript_data_gtf(pybedtools.BedTool(options.gtfFile ), options.premRNA).saveas() 
  File "/gnu/store/2wsbia11jiyppc4x18w8ff8nvwjhp5na-clipper-0. 3.0/lib/python2.7/site-packages/clipper-0.2.0-py2.7-linux-x86 _64.egg/clipper/src/peakfinder.py", line 255, in build_transcript_data_gtf 
    "gene_id=" + gene['gene_id'] + "; transcript_id=" + gene['transcript_id'] + "; effective_length=" + str(effective_length)])) 
  File "pybedtools/cbedtools.pyx", line 500, in pybedtools.cbedtools.create_interval_from_list (pybedtools/cbedtools.cpp:8145) 
  File "pybedtools/cbedtools.pyx", line 562, in pybedtools.cbedtools.create_interval_from_list (pybedtools/cbedtools.cpp:7945) 
OverflowError: can't convert negative value to CHRPOS 
olgabot commented 8 years ago

Edited to have the code snippets be code formatted

gpratt commented 8 years ago

What version of clipper are you using? I think I fixed the CHRPOS bug in a recent commit. Please use the master branch from the github repo or one of the official encode version.

The --GTF option is something I should have removed a while ago. It was an experimental thing that was too hard to get working properly (GTF files aren't totally standard, so you end up just writing custom parsers for each species)

If you don't have a custom gene list you can just use --species dm3. If you want to use custom regions you can make your own gff file.

Its a standard gff file with 3 attributes

chrM AS_STRUCTURE gene 577 647 . + . gene_id=ENSG00000210049.1;mrna_length=71;premrna_length=71

gene_id is a unique gene id. mrna_length is the length of the mRNA without introns premrna_length is the length of the mRNA with introns.

I'll try to document that more formally somewhere.

Let me know if you've got any questions or if that doesn't work for you!

mattgeorge28 commented 7 years ago

Could you elaborate on generating the custom .gff file/ adding the attributes? I have a Stringtie-generated custom .gtf transcript list mapped to mm10. Thanks! Matt

mattgeorge28 commented 7 years ago

this is what my .gtf looks like

chr1 StringTie transcript 4490408 4496454 1000 + . gene_id "MSTRG.13"; transcript_id "MSTRG.13.2"; chr1 StringTie exon 4490408 4496454 1000 + . gene_id "MSTRG.13"; transcript_id "MSTRG.13.2"; exon_number "1"; chr1 StringTie transcript 4490928 4496413 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4490928 4492668 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "1"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4493100 4493466 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "2"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4493772 4493863 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "3"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4495136 4495942 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "4"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4496291 4496413 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "5"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902";

Thanks again

gpratt commented 7 years ago

I don't officially support non-standard reference genomes anymore. However if you want you can hack it.

You'll need to make two things. First a XX_exons.bed file.

This file should contain a listing on the exons you want to look at and follows the bed format

chr1 11869 12227 ENSG00000223972.5 0 +

col1: chrom

col2: exon start

col3: exon stop

col4: gene id (used to identify all exons in a gene)

col5: 0 (empty, used for score normally)

col6: strand of exon

The the created file should go into

clipper/clipper/data/regions

You'll also need a XX_AS.STRUCTURE.COMPILED.gff

it follows a fairly standard gff file format

chr19 AS_STRUCTURE gene 68403 69146 . + . gene_id=ENSG00000267111.1;mrna_length=744;premrna_length=744

col1: chrom

col2: AS_STRUCTURE (just leave it as is)

col3: gene (just leave that as is as well, there are only features of type gene in this file)

col4: gene start

col5: gene stop

col6: . (empty)

col7: strand of gene

col8: . (empty)

col9: extra features. It needs to follow format I've got above exactly

gene_id=XXX;mrna_length=XXX;premrna_length=XXX

gene_id is a unique ID for the gene. It needs to match the gene IDs found in the XX_exons.bed file

mrna_length is the length of the predicted mRNA. I sum all exon lengths to get this

premrna_length is the length of the pre-mrna. I calculate this just by doing gene stop - gene start.

Let me know if you've got any questions.

Cheers,

Gabriel Pratt Bioinformatics Graduate Student, Yeo Lab University of California San Diego

On Tue, Jan 17, 2017 at 10:24 AM, mattgeorge28 notifications@github.com wrote:

this is what my .gtf looks like

chr1 StringTie transcript 4490408 4496454 1000 + . gene_id "MSTRG.13"; transcript_id "MSTRG.13.2"; chr1 StringTie exon 4490408 4496454 1000 + . gene_id "MSTRG.13"; transcript_id "MSTRG.13.2"; exon_number "1"; chr1 StringTie transcript 4490928 4496413 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4490928 4492668 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "1"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4493100 4493466 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "2"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4493772 4493863 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "3"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4495136 4495942 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "4"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4496291 4496413 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "5"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902";

Thanks again

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/YeoLab/clipper/issues/39#issuecomment-273254256, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNUW1zGd0838P0AfPd3TomwVfXUloAGks5rTQdUgaJpZM4Jh-36 .

mattgeorge28 commented 7 years ago

Hi Gabriel, I have these custom files but need a little more clarification as to how to specify them when running clipper. Both files go into clipper/clipper/data/regions? How do I specify them when running clipper? I'm a little confused by how to incorporate this given these options:

--gtfFile=GTFFILE use a gtf file instead of the AS structure data --customBED=BEDFILE bed file to call peaks on, must come withOUT species and with customMRNA and customPREMRNA --customMRNA=FILE file with mRNA lengths for your bed file in format: GENENAMELEN --customPREMRNA=FILE file with pre-mRNA lengths for your bed file in format: GENENAMELEN

Thanks again, Matt

gnilihzeux commented 5 years ago

Hi Gabriel, I have these custom files but need a little more clarification as to how to specify them when running clipper. Both files go into clipper/clipper/data/regions? How do I specify them when running clipper? I'm a little confused by how to incorporate this given these options:

--gtfFile=GTFFILE use a gtf file instead of the AS structure data --customBED=BEDFILE bed file to call peaks on, must come withOUT species and with customMRNA and customPREMRNA --customMRNA=FILE file with mRNA lengths for your bed file in format: GENENAMELEN --customPREMRNA=FILE file with pre-mRNA lengths for your bed file in format: GENENAMELEN

Thanks again, Matt

Hello, @mattgeorge28 . Did you achieve your goals? I had the same question with you, but It seems I have to hack author's source code, which is a huge problem for me. THANKS.

byee4 commented 5 years ago

You'll need to place these files in the following directories:

data/XX_AS.STRUCTURE.COMPILED.gff data/regions/XX_exons.bed file