Open hwessels opened 8 years ago
Edited to have the code snippets be code formatted
What version of clipper are you using? I think I fixed the CHRPOS bug in a recent commit. Please use the master branch from the github repo or one of the official encode version.
The --GTF option is something I should have removed a while ago. It was an experimental thing that was too hard to get working properly (GTF files aren't totally standard, so you end up just writing custom parsers for each species)
If you don't have a custom gene list you can just use --species dm3. If you want to use custom regions you can make your own gff file.
Its a standard gff file with 3 attributes
chrM AS_STRUCTURE gene 577 647 . + . gene_id=ENSG00000210049.1;mrna_length=71;premrna_length=71
gene_id is a unique gene id. mrna_length is the length of the mRNA without introns premrna_length is the length of the mRNA with introns.
I'll try to document that more formally somewhere.
Let me know if you've got any questions or if that doesn't work for you!
Could you elaborate on generating the custom .gff file/ adding the attributes? I have a Stringtie-generated custom .gtf transcript list mapped to mm10. Thanks! Matt
this is what my .gtf looks like
chr1 StringTie transcript 4490408 4496454 1000 + . gene_id "MSTRG.13"; transcript_id "MSTRG.13.2"; chr1 StringTie exon 4490408 4496454 1000 + . gene_id "MSTRG.13"; transcript_id "MSTRG.13.2"; exon_number "1"; chr1 StringTie transcript 4490928 4496413 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4490928 4492668 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "1"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4493100 4493466 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "2"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4493772 4493863 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "3"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4495136 4495942 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "4"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4496291 4496413 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "5"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902";
Thanks again
I don't officially support non-standard reference genomes anymore. However if you want you can hack it.
You'll need to make two things. First a XX_exons.bed file.
This file should contain a listing on the exons you want to look at and follows the bed format
chr1 11869 12227 ENSG00000223972.5 0 +
col1: chrom
col2: exon start
col3: exon stop
col4: gene id (used to identify all exons in a gene)
col5: 0 (empty, used for score normally)
col6: strand of exon
The the created file should go into
clipper/clipper/data/regions
You'll also need a XX_AS.STRUCTURE.COMPILED.gff
it follows a fairly standard gff file format
chr19 AS_STRUCTURE gene 68403 69146 . + . gene_id=ENSG00000267111.1;mrna_length=744;premrna_length=744
col1: chrom
col2: AS_STRUCTURE (just leave it as is)
col3: gene (just leave that as is as well, there are only features of type gene in this file)
col4: gene start
col5: gene stop
col6: . (empty)
col7: strand of gene
col8: . (empty)
col9: extra features. It needs to follow format I've got above exactly
gene_id=XXX;mrna_length=XXX;premrna_length=XXX
gene_id is a unique ID for the gene. It needs to match the gene IDs found in the XX_exons.bed file
mrna_length is the length of the predicted mRNA. I sum all exon lengths to get this
premrna_length is the length of the pre-mrna. I calculate this just by doing gene stop - gene start.
Let me know if you've got any questions.
Cheers,
Gabriel Pratt Bioinformatics Graduate Student, Yeo Lab University of California San Diego
On Tue, Jan 17, 2017 at 10:24 AM, mattgeorge28 notifications@github.com wrote:
this is what my .gtf looks like
chr1 StringTie transcript 4490408 4496454 1000 + . gene_id "MSTRG.13"; transcript_id "MSTRG.13.2"; chr1 StringTie exon 4490408 4496454 1000 + . gene_id "MSTRG.13"; transcript_id "MSTRG.13.2"; exon_number "1"; chr1 StringTie transcript 4490928 4496413 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4490928 4492668 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "1"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4493100 4493466 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "2"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4493772 4493863 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "3"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4495136 4495942 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "4"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902"; chr1 StringTie exon 4496291 4496413 1000 - . gene_id "MSTRG.14"; transcript_id "ENSMUST00000027035"; exon_number "5"; gene_name "Sox17"; ref_gene_id "ENSMUSG00000025902";
Thanks again
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/YeoLab/clipper/issues/39#issuecomment-273254256, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNUW1zGd0838P0AfPd3TomwVfXUloAGks5rTQdUgaJpZM4Jh-36 .
Hi Gabriel, I have these custom files but need a little more clarification as to how to specify them when running clipper. Both files go into clipper/clipper/data/regions? How do I specify them when running clipper? I'm a little confused by how to incorporate this given these options:
--gtfFile=GTFFILE use a gtf file instead of the AS structure data
--customBED=BEDFILE bed file to call peaks on, must come withOUT species
and with customMRNA and customPREMRNA
--customMRNA=FILE file with mRNA lengths for your bed file in format:
GENENAME
Thanks again, Matt
Hi Gabriel, I have these custom files but need a little more clarification as to how to specify them when running clipper. Both files go into clipper/clipper/data/regions? How do I specify them when running clipper? I'm a little confused by how to incorporate this given these options:
--gtfFile=GTFFILE use a gtf file instead of the AS structure data --customBED=BEDFILE bed file to call peaks on, must come withOUT species and with customMRNA and customPREMRNA --customMRNA=FILE file with mRNA lengths for your bed file in format: GENENAMELEN --customPREMRNA=FILE file with pre-mRNA lengths for your bed file in format: GENENAMELEN
Thanks again, Matt
Hello, @mattgeorge28 . Did you achieve your goals? I had the same question with you, but It seems I have to hack author's source code, which is a huge problem for me. THANKS.
You'll need to place these files in the following directories:
data/XX_AS.STRUCTURE.COMPILED.gff data/regions/XX_exons.bed file
Hi
I am trying to run clipper on Drosophila data, providing a GTF instead of a species -s option. (Cou you provide information on how to construct a species file for other model organism)
In a test scenario I ran it on human data, which went thought with thousands of warnings, like:
However, it produces output.
Given i wanna now run it on Drosophila data, i got from the clipper help prompt, that providing a GTF might be sufficient (--gtfFile=GTFFILE use a gtf file instead of the AS structure data) I am not quite sure what AS structure data is. However, I am assuming that clipper reads the GTF file and constructs customBED,customMRNA and customPREMRNA
Given that GTF files can be different from each other, depending from where you download them, what are the GTF requirements?
Or am I doing sty completely wrong here?: