cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium
https://cumc.github.io/xqtl-protocol/
MIT License
38 stars 42 forks source link

Duplicate entries in leafcutter output. #418

Closed hsun3163 closed 2 years ago

hsun3163 commented 2 years ago

In the output of leafcutter for ROSMAP, ~600 out of ~200000 of the rows are duplicated, exemplified by

[skandoi@scc4 batch_all]$ cat /restricted/projectnb/casa/skandoi/ROSMAP_DLPFC/leafcutter_output/batch_all/leafcutter_bam_intron_usage_perind.counts.gz_raw_data.qqnorm.phenotype_group.txt | grep "chr3:125766818:125848253:clu_64950_+:ENSG000002846"
chr3:125766818:125848253:clu_64950_+:ENSG00000284624    ENSG00000284624
chr3:125766818:125848253:clu_64950_+:ENSG00000284660    ENSG00000284660

However, this behavior is not observed in our MWE and protocol data

hs3163@node45:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/output/leaf_cutter$ zcat xqtl_protocol_data_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.formated.bed.gz | cut -f4 | s
ort | uniq -d

hs3163@node45:/mnt/vast/hpc/csg/xqtl_workflow_testing/finalizing/output/leaf_cutter$  zcat /mnt/mfs/hgrcgrid/homes/xc2610/final_MWE/leafcutter_output/sample_fastq_bam_list_intron_usage_perind.counts.gz_raw_data.qqnorm.formated.bed.gz | cut -f4 | sort | uniq -d
hsun3163 commented 2 years ago

It is unclear which step caused this issue, but to avoid it. a remove duplicate step will be added at the end of leafcutter annotation.

gaow commented 2 years ago

It looks like the same genomic region is annotated to two genes? We should figure out why it happened in the first place then prevent it from happening (not removing it). This gene is a good example to check.

gaow commented 2 years ago

The problem with removing is which gene would you remove? You can imagine it could cause serious issues when you randomly remove the "wrong" gene.

hsun3163 commented 2 years ago

The problem with removing is which gene would you remove? You can imagine it could cause serious issues when you randomly remove the "wrong" gene.

They are exactly the same line, so I think it is safe to drop it. I will do the processing via remote control and further investigate the issue myself after the sqtl was produced.

gaow commented 2 years ago

Sorry I'm confused ... these lines are not the same line. It's the same event annotated to different genes and I wonder why that happened in the first place

chr3:125766818:125848253:clu_64950_+:ENSG00000284624    ENSG00000284624
chr3:125766818:125848253:clu_64950_+:ENSG00000284660    ENSG00000284660
hsun3163 commented 2 years ago

Sorry I'm confused ... these lines are not the same line. It's the same event annotated to different genes and I wonder why that happened in the first place

chr3:125766818:125848253:clu_64950_+:ENSG00000284624    ENSG00000284624
chr3:125766818:125848253:clu_64950_+:ENSG00000284660    ENSG00000284660

I just noticed this. Cuz they are produced by grep "chr3:125766818:125848253:clu_64950_+:ENSG00000284624" so I didn't noticed guess the : produce lead to some confusion for the grep

gaow commented 2 years ago

To clarify, what you will remove are exact duplicates (as implemented in #419) and not cases such as the above? And we still don't know why duplicates occur in the first place?

hsun3163 commented 2 years ago

As it turns out, this ticket is a misunderstanding regarding the nature of the issue at hand. Will open up a new one when more light is shed on.