AdamaJava / adamajava

Other
14 stars 5 forks source link

qsignature generator accepts gff3 input #247

Closed holmeso closed 3 years ago

holmeso commented 3 years ago

Description

A requirement has been raised to enable qsignature to report upon positions listed within a gff3 file. The rational for this is that it would be desirable to see the coverage that BAM files have in selected genes. This would assist in determining if any calls were able to be made in these genes.

Entries in gff3 files can represent ranges (ie. gene positional information) as follows:

chr19   ensembl_havana  gene    39687601        39692524        .       +       .       ID=gene:ENSG00000188505;Name=NCCRP1;biotype=protein_coding;description=non-specific cytotoxic cell receptor protein 1 homolog (zebrafish) [Source:HGNC Symbol%3BAcc:33739];gene_id=ENSG00000188505;logic_name=ensembl_havana_gene;version=4
chr9    ensembl_havana  gene    14734664        14910993        .       -       .       ID=gene:ENSG00000164946;Name=FREM1;biotype=protein_coding;description=FRAS1 related extracellular matrix 1 [Source:HGNC Symbol%3BAcc:23399];gene_id=ENSG00000164946;logic_name=ensembl_havana_gene;version=15
chr12   ensembl_havana  gene    25357723        25403870        .       -       .       ID=gene:ENSG00000133703;Name=KRAS;biotype=protein_coding;description=Kirsten rat sarcoma viral oncogene homolog [Source:HGNC
chr8    ensembl_havana  gene    95938200        95961639        .       -       .       ID=gene:ENSG00000164938;Name=TP53INP1;biotype=protein_coding;description=tumor protein p53 inducible nuclear protein 1 [Source:HGNC Symbol%3BAcc:18022];gene_id=ENSG00000164938;logic_name=ensembl_havana_gene;version=9
chr10   ensembl_havana  gene    127512115       127542264       .       +       .       ID=gene:ENSG00000107949;Name=BCCIP;biotype=protein_coding;description=BRCA2 and CDKN1A interacting protein [Source:HGNC Symbol%3BAcc:978];gene_id=ENSG00000107949;logic_name=ensembl_havana_gene;version=12

The SignatureGeneratorBespoke class has been modified to allow a genePositions file to be supplied as an option. When this is supplied, the class will examine each entry in the gff3 file. For each entry, it will add an element to a list for each position. eg. for the following entry:

chr19   ensembl_havana  gene    39687601        39692524        .       +       .       ID=gene:ENSG00000188505;Name=NCCRP1;biotype=protein_coding;description=non-specific cytotoxic cell receptor protein 1 homolog (zebrafish) [Source:HGNC

there would be 4923 (39692524 - 39687601) elements added to the list, from chr19:39687601 to chr19:39692524.

This results in the output vcf file containing coverage information for each of those positions.

To enable the REF field of the output vcf to be populated (and thus for the vcf file to validate), an option to add the reference fasta file (-reference) has also been added. If a genePositions file is provided without a reference file, an error will be thrown.

Type of change

Please delete options that are not relevant.

How Has This Been Tested?

Existing unit tests pass. New code has been tested against genePositions and snpPositions inputs Compare class has been run against generated vcf files.

Checklist:

ChristinaXu2017 commented 3 years ago

Why didn't any unit tests update?

holmeso commented 3 years ago

Why didn't any unit tests update?

This PR was created as a draft (not ready for review). It looks like you've marked the PR as ready for review. I'll switch it back to draft, and when I'm finished, will revert the status again.

holmeso commented 3 years ago

SignatureGenerator.java is deprecated, could you please remove deprecated annotation?

I think that class should keep the deprecated annotation because it is still deprecated. There is an issue ( #233 ) dealing with deprecated classes in this package so best to keep this work and the deprecation work seperate.