cumc / xqtl-protocol

Molecular QTL analysis protocol developed by ADSP Functional Genomics Consortium
https://cumc.github.io/xqtl-protocol/
MIT License
38 stars 42 forks source link

update regional data generation in cis workhorse #872

Closed rfeng2023 closed 6 months ago

rfeng2023 commented 6 months ago

@gaow

  1. for now with 4 conditions, accumulated_pheno_df is like below,
    
    #chr    start   end ID  path    Original_ID cov_path    cond
    0   chr19   44905790    44905791    ENSG00000130203 /mnt/vast/hpc/homes/zq2209/work/input/ROSMAP_p...   chr19_ENSG00000130203_P02649    /mnt/vast/hpc/homes/zq2209/work/input/ROSMAP_p...   DLPFC_Bennett_pQTL
    1   chr19   44905791    44909393    ENSG00000130203 /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop...   gp_1235_P02649  /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop...   DLPFC_Klein_gpQTL_adjusted
    2   chr19   44905791    44909393    ENSG00000130203 /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop...   gp_1236_P02649  /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop...   DLPFC_Klein_gpQTL_adjusted
    3   chr19   44905791    44909393    ENSG00000130203 /mnt/vast/hpc/homes/ss7112/QTL_Analysis/GP_Pro...   gp_1235_P02649  /mnt/vast/hpc/homes/ss7112/QTL_Analysis/GP_Pro...   DLPFC_Klein_gpQTL_unadjusted
    4   chr19   44905791    44909393    ENSG00000130203 /mnt/vast/hpc/homes/ss7112/QTL_Analysis/GP_Pro...   gp_1236_P02649  /mnt/vast/hpc/homes/ss7112/QTL_Analysis/GP_Pro...   DLPFC_Klein_gpQTL_unadjusted
    5   chr19   44905790    44905791    ENSG00000130203 /home/al4225/xqtl_data/ROSMAP/DLPFC/monocyte/o...   ENSG00000130203 /home/al4225/xqtl_data/ROSMAP/DLPFC/monocyte/o...   Monocyte_ROSMAP_eQTL
with your previous codes, `meta_data` would show something like this:
#chr    ID  cond    cov_path    end path    start   Original_ID

0 chr19 ENSG00000130203 DLPFC_Bennett_pQTL /mnt/vast/hpc/homes/zq2209/work/input/ROSMAP_p... 44905791 /mnt/vast/hpc/homes/zq2209/work/input/ROSMAP_p... 44905790 chr19_ENSG00000130203_P02649 1 chr19 ENSG00000130203 DLPFC_Klein_gpQTL_adjusted,DLPFC_Klein_gpQTL_u... /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop... 44909393 /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop... 44905791 gp_1235_P02649,gp_1236_P02649 2 chr19 ENSG00000130203 Monocyte_ROSMAP_eQTL /home/al4225/xqtl_data/ROSMAP/DLPFC/monocyte/o... 44905791 /home/al4225/xqtl_data/ROSMAP/DLPFC/monocyte/o... 44905790 ENSG00000130203



 the paths are not aggregated, I am wondering why we not just paste each path and original ID one by one through `join`, so I defined one function `merge_rows`, and `accumulated_pheno_df` is processed with that directly. 

2. I've added the code to detect the empty_indices without testing yet, please review it and please let @Bale4 test it. 
review-notebook-app[bot] commented 6 months ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

gaow commented 6 months ago

I am wondering why we not just paste each path and original ID one by one through join, so I defined one function merge_rows

@rfeng2023 this is what i did before but you removed it.

I think using agg is more elegant, if it works? I wonder why it does not work.

gaow commented 6 months ago

For this line,

    accumulated_pheno_df = accumulated_pheno_df.accumulated_pheno_df.groupby("ID").apply(merge_rows).reset_index()

I would feel much more comfortable if we group it by chr, start, end, ID not just ID. In fact, can there be cases where different isoforms of the same gene has different coordinates? If so, when we aggregate by the original ID, should we take the minimum of the start position and the maximum of the end? That will still result in two lines:

  1. we group it by chr, start, end, ID, Original_ID and aggregate the paths
  2. we group by chr and ID but aggregate it such that the start is the min() of all the start values, end is the max() of all end values, then Original_ID will be pasted with ,.

Does it make sense? If so could you give it a try and test?

gaow commented 6 months ago

I've added the code to detect the empty_indices without testing yet, please review it and please let @Bale4 test it.

I would prefer a more plain version of the implementation where setdiff is not involved. As i discussed with you today, set by its mathematical definition does not retain order. The actual implementation may be different from one language to another. But it is a behavior we should nto rely on --- it would make a statistician very uncomfortable to use sets in code assuming the order is retained. A simpler version is to keep the exact indices of which results are empty and remove those precise indices.

rfeng2023 commented 6 months ago

For this line,

    accumulated_pheno_df = accumulated_pheno_df.accumulated_pheno_df.groupby("ID").apply(merge_rows).reset_index()

I would feel much more comfortable if we group it by chr, start, end, ID not just ID. In fact, can there be cases where different isoforms of the same gene has different coordinates? If so, when we aggregate by the original ID, should we take the minimum of the start position and the maximum of the end? That will still result in two lines:

  1. we group it by chr, start, end, ID, Original_ID and aggregate the paths
  2. we group by chr and ID but aggregate it such that the start is the min() of all the start values, end is the max() of all end values, then Original_ID will be pasted with ,.

Does it make sense? If so could you give it a try and test?

for1: accumulated_pheno_df.groupby(["#chr", "start", "end", "ID", "Original_ID"]).apply(merge_rows).reset_index(drop=True)the output would be

Original_ID cond    path    cov_path
0   ENSG00000130203 Monocyte_ROSMAP_eQTL    /home/al4225/xqtl_data/ROSMAP/DLPFC/monocyte/o...   /home/al4225/xqtl_data/ROSMAP/DLPFC/monocyte/o...
1   chr19_ENSG00000130203_P02649    DLPFC_Bennett_pQTL  /mnt/vast/hpc/homes/zq2209/work/input/ROSMAP_p...   /mnt/vast/hpc/homes/zq2209/work/input/ROSMAP_p...
2   gp_1235_P02649,gp_1235_P02649   DLPFC_Klein_gpQTL_adjusted,DLPFC_Klein_gpQTL_u...   /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop...   /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop...
3   gp_1236_P02649,gp_1236_P02649   DLPFC_Klein_gpQTL_adjusted,DLPFC_Klein_gpQTL_u...   /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop...   /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop...

for In fact, can there be cases where different isoforms of the same gene has different coordinates? if the coordinates here means region, then the answer is yes, the region of isoforms could be different, however, I think as we discussed before, we are still using the same gene region(extended cis window) here. I think maybe analysts are using different ref here, to cause the different start and end. and them for 2 we group by chr and ID but aggregate it such that the start is the min() of all the start values, end is the max() of all end values, then Original_ID will be pasted with ,. I will change to this way in next commit

rfeng2023 commented 6 months ago

I would prefer a more plain version of the implementation where setdiff is not involved. As i discussed with you today, set by its mathematical definition does not retain order. The actual implementation may be different from one language to another. But it is a behavior we should nto rely on --- it would make a statistician very uncomfortable to use sets in code assuming the order is retained. A simpler version is to keep the exact indices of which results are empty and remove those precise indices.

empty_indices <- which(sapply(results, function(x) length(x) == 0))
empty_elements <- c(empty_elements, empty_indices + length(fitted)) # Adjust the indices based on already added results

this part is in R, and empty_indices here is an index, setdiff in R used here would report the different index, it should be different from the set case we discussed today

rfeng2023 commented 6 months ago

replaced setdiff with another way, still without test. (Actually I tried to test it in jupyter notebook, but it complained a lot about

Warning message in optimize(function(par) fn(par, ...)/con$fnscale, lower = lower, :
“NA/Inf replaced by maximum positive value”

to crush my notebook. and I did not see it in the terminal before. And that should not be related to the modification in this PR...) but I think logically it should be correct, and please review it. Additionally, I try to run finemapping in my terminal today, and it complaining

Error in susie(X, Y, L = 1) : could not find function "susie"
gaow commented 6 months ago

Thanks @rfeng2023 what you have reads correct although i still don't know what i did wrong previously (try me tmr but not right now since my brain barely works at this moment).

Error in susie(X, Y, L = 1) : could not find function "susie"

It looks like we also need to add library(susieR) to the code because you might notice we have been using susie_wrapper from pecotmr , never susie() function directly until we put in the quick, L=1 test. I will merge after you add that line in.

rfeng2023 commented 6 months ago

I've modified the pipeline as we discussed, the regional data in this version would be

{'data': [("['/home/rf2872/Work/leaf_cutter2/ROSMAP_DLPFC_2024/Genotype/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.bed']",
   'pheno_path1',
   'pheno_path2',
   'pheno_path3',
   'pheno_path4',
   'cov_path1',
   'cov_path2',
   'cov_path3',
   'cov_path4',],
 'meta_info': [('chr19:44905790-44909393',
   'chr19:41840000-47960000',
   'ENSG00000130203',
   'chr19_ENSG00000130203_P02649,(gp_1235_P02649,gp_1236_P02649),(gp_1235_P02649,gp_1236_P02649),ENSG00000130203',
   'DLPFC_Bennett_pQTL',
   'DLPFC_Klein_gpQTL_adjusted',
   'DLPFC_Klein_gpQTL_unadjusted',
   'Monocyte_ROSMAP_eQTL')]}
Bale4 commented 6 months ago

The execution of the job was halted after running for some time with this error Error: object 'empty_elements_indices' not found. I will check with a different region if that helps to investigate more