cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium
https://cumc.github.io/xqtl-protocol/
MIT License
41 stars 43 forks source link

Mandatory for GTF to be gene-collapsed before entering the annotation_coord for phenotype file #170

Open hsun3163 opened 2 years ago

hsun3163 commented 2 years ago

As it turn out. The gene.gtf file provided in out snuc study is exon based instead of gene based. This create a lot of records with duplicates gene IDs but different start-end position and may leads to incorrectly labeled bed file.

Therefore, from a workflow perspective, the gtf needs to be preprocessed via the collapsed gene module.

hsun3163 commented 2 years ago

This understanding expand the scenario of using hg_gtf step of reference preprocessing module to where the fasta file may not be available. Despite that we can easily find a place holder for it, it is still desirable to remove the requirement of it in the hg_gtf step to make things tidy.

gaow commented 2 years ago

@hsun3163 we might want to try the same logic as the adding chr prefix to VCF file -- first, we somehow determine if this is needed at all (we dont have this mechanism for now!), then we trigger the workflow by requiring a specific file target.

I'll have to think of this a bit myself -- if you have a solution in your head please run it by me first before you make changes.

hsun3163 commented 2 years ago

At the moment we already have this mechanism, if the chr prefix is in the the gtf already. Or the fasta file dont have a chr prefix, nothing will be added.

For the vcf, the chr will not be added directly either, it is more like changing the chr name to chr*, so no surfix will be added if the chr name format is correct, however a add_chr surfix will still be added.

The add_chr surfix will not be added if user dont specify the --add-chr option.