Open hsun3163 opened 2 years ago
This understanding expand the scenario of using hg_gtf step of reference preprocessing
module to where the fasta file may not be available. Despite that we can easily find a place holder for it, it is still desirable to remove the requirement of it in the hg_gtf step to make things tidy.
@hsun3163 we might want to try the same logic as the adding chr
prefix to VCF file -- first, we somehow determine if this is needed at all (we dont have this mechanism for now!), then we trigger the workflow by requiring a specific file target.
I'll have to think of this a bit myself -- if you have a solution in your head please run it by me first before you make changes.
At the moment we already have this mechanism, if the chr prefix is in the the gtf already. Or the fasta file dont have a chr prefix, nothing will be added.
For the vcf, the chr will not be added directly either, it is more like changing the chr name to chr*, so no surfix will be added if the chr name format is correct, however a add_chr surfix will still be added.
The add_chr surfix will not be added if user dont specify the --add-chr option.
As it turn out. The gene.gtf file provided in out snuc study is exon based instead of gene based. This create a lot of records with duplicates gene IDs but different start-end position and may leads to incorrectly labeled bed file.
Therefore, from a workflow perspective, the gtf needs to be preprocessed via the collapsed gene module.