daler / gffutils

GFF and GTF file manipulation and interconversion
http://daler.github.io/gffutils
MIT License
287 stars 78 forks source link

add a function to create splice sites similar to create_introns #220

Closed Juke34 closed 1 year ago

Juke34 commented 1 year ago

I needed a function to create splice sites features easily. I propose a create_splice_sites function to do it similar to the create_introns function. Maybe not optimal because it loops twice (one for the left splice site and once for the right splice site) over the interfeatures function but I didn't succeed to make it work differently. The loop occurs before

for child in child_gen():
                exons = self.children(
                    child, level=1, featuretype=exon_featuretype, order_by="start"
                )
daler commented 1 year ago

Thanks. Can you add tests (at the bottom of gffutils/test/test.py) for this? You can run the tests locally with pytest. Still working out how to approve GitHub Actions to run on PRs from a fork, so this is not happening automatically.

Juke34 commented 1 year ago

Hi, I didn't succeed to write a proper test. I'm not proficient enough in pytest. I wanted to use test/data/gff_example1.gff3 as input. The expected output is:

chr1    ensGene gene    4763287 4775820 .   -   .   Name=ENSMUSG00000033845;ID=ENSMUSG00000033845;Alias=ENSMUSG00000033845;gid=ENSMUSG00000033845
chr1    ensGene mRNA    4764517 4775779 .   -   .   Name=ENSMUST00000045689;Parent=ENSMUSG00000033845;ID=ENSMUST00000045689;Alias=ENSMUSG00000033845;gid=ENSMUSG00000033845
chr1    ensGene CDS 4775654 4775758 .   -   0   Name=ENSMUST00000045689.cds0;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.cds0;gid=ENSMUSG00000033845
chr1    ensGene CDS 4772761 4772814 .   -   0   Name=ENSMUST00000045689.cds1;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.cds1;gid=ENSMUSG00000033845
chr1    ensGene exon    4775654 4775779 .   -   .   Name=ENSMUST00000045689.exon0;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon0;gid=ENSMUSG00000033845
chr1    ensGene exon    4772649 4772814 .   -   .   Name=ENSMUST00000045689.exon1;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon1;gid=ENSMUSG00000033845
chr1    ensGene exon    4767606 4767729 .   -   .   Name=ENSMUST00000045689.exon2;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon2;gid=ENSMUSG00000033845
chr1    ensGene exon    4764517 4764597 .   -   .   Name=ENSMUST00000045689.exon3;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.exon3;gid=ENSMUSG00000033845
chr1    ensGene five_prime_UTR  4775759 4775779 .   -   .   Name=ENSMUST00000045689.utr0;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr0;gid=ENSMUSG00000033845
chr1    ensGene three_prime_UTR 4772649 4772760 .   -   .   Name=ENSMUST00000045689.utr1;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr1;gid=ENSMUSG00000033845
chr1    ensGene three_prime_UTR 4767606 4767729 .   -   .   Name=ENSMUST00000045689.utr2;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr2;gid=ENSMUSG00000033845
chr1    ensGene three_prime_UTR 4764517 4764597 .   -   .   Name=ENSMUST00000045689.utr3;Parent=ENSMUST00000045689;ID=ENSMUST00000045689.utr3;gid=ENSMUSG00000033845
chr1    gffutils_derived    three_prime_cis_splice_site 4764598 4764599 .   -   .   Name=ENSMUST00000045689.exon2,ENSMUST00000045689.exon3;Parent=ENSMUST00000045689;ID=three_prime_cis_splice_site_ENSMUST00000045689.exon2-ENSMUST00000045689.exon3;gid=ENSMUSG00000033845
chr1    gffutils_derived    three_prime_cis_splice_site 4767730 4767731 .   -   .   Name=ENSMUST00000045689.exon1,ENSMUST00000045689.exon2;Parent=ENSMUST00000045689;ID=three_prime_cis_splice_site_ENSMUST00000045689.exon1-ENSMUST00000045689.exon2;gid=ENSMUSG00000033845
chr1    gffutils_derived    three_prime_cis_splice_site 4772815 4772816 .   -   .   Name=ENSMUST00000045689.exon0,ENSMUST00000045689.exon1;Parent=ENSMUST00000045689;ID=three_prime_cis_splice_site_ENSMUST00000045689.exon0-ENSMUST00000045689.exon1;gid=ENSMUSG00000033845
chr1    gffutils_derived    five_prime_cis_splice_site  4767604 4767605 .   -   .   Name=ENSMUST00000045689.exon2,ENSMUST00000045689.exon3;Parent=ENSMUST00000045689;ID=five_prime_cis_splice_site_ENSMUST00000045689.exon2-ENSMUST00000045689.exon3;gid=ENSMUSG00000033845
chr1    gffutils_derived    five_prime_cis_splice_site  4772647 4772648 .   -   .   Name=ENSMUST00000045689.exon1,ENSMUST00000045689.exon2;Parent=ENSMUST00000045689;ID=five_prime_cis_splice_site_ENSMUST00000045689.exon1-ENSMUST00000045689.exon2;gid=ENSMUSG00000033845
chr1    gffutils_derived    five_prime_cis_splice_site  4775652 4775653 .   -   .   Name=ENSMUST00000045689.exon0,ENSMUST00000045689.exon1;Parent=ENSMUST00000045689;ID=five_prime_cis_splice_site_ENSMUST00000045689.exon0-ENSMUST00000045689.exon1;gid=ENSMUSG00000033845

The interesting features are the 6 last lines.

I didn't find any test for create_introns to get inspired from... Could you help?

daler commented 1 year ago

I bet you weren't doing anything wrong -- I just found out that gffutils/test/test.py has not been running since porting tests from nosetests to pytest! Renaming the file did the trick. I added the new test in there, and I'll merge into 0.12rc branch.

daler commented 1 year ago

No need for this now, but I wonder if iterating over the exons in pairs (e.g. with itertools.pairwise or similar) would make this and create_introns more efficient.