Closed rfeng2023 closed 6 months ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
I am wondering why we not just paste each path and original ID one by one through join, so I defined one function merge_rows
@rfeng2023 this is what i did before but you removed it.
I think using agg
is more elegant, if it works? I wonder why it does not work.
For this line,
accumulated_pheno_df = accumulated_pheno_df.accumulated_pheno_df.groupby("ID").apply(merge_rows).reset_index()
I would feel much more comfortable if we group it by chr, start, end, ID
not just ID
. In fact, can there be cases where different isoforms of the same gene has different coordinates? If so, when we aggregate by the original ID, should we take the minimum of the start position and the maximum of the end? That will still result in two lines:
chr, start, end, ID, Original_ID
and aggregate the pathschr
and ID
but aggregate it such that the start
is the min()
of all the start
values, end
is the max()
of all end
values, then Original_ID
will be pasted with ,
. Does it make sense? If so could you give it a try and test?
I've added the code to detect the empty_indices without testing yet, please review it and please let @Bale4 test it.
I would prefer a more plain version of the implementation where setdiff
is not involved. As i discussed with you today, set
by its mathematical definition does not retain order. The actual implementation may be different from one language to another. But it is a behavior we should nto rely on --- it would make a statistician very uncomfortable to use sets in code assuming the order is retained. A simpler version is to keep the exact indices of which results are empty and remove those precise indices.
For this line,
accumulated_pheno_df = accumulated_pheno_df.accumulated_pheno_df.groupby("ID").apply(merge_rows).reset_index()
I would feel much more comfortable if we group it by
chr, start, end, ID
not justID
. In fact, can there be cases where different isoforms of the same gene has different coordinates? If so, when we aggregate by the original ID, should we take the minimum of the start position and the maximum of the end? That will still result in two lines:
- we group it by
chr, start, end, ID, Original_ID
and aggregate the paths- we group by
chr
andID
but aggregate it such that thestart
is themin()
of all thestart
values,end
is themax()
of allend
values, thenOriginal_ID
will be pasted with,
.Does it make sense? If so could you give it a try and test?
for1: accumulated_pheno_df.groupby(["#chr", "start", "end", "ID", "Original_ID"]).apply(merge_rows).reset_index(drop=True)
the output would be
Original_ID cond path cov_path
0 ENSG00000130203 Monocyte_ROSMAP_eQTL /home/al4225/xqtl_data/ROSMAP/DLPFC/monocyte/o... /home/al4225/xqtl_data/ROSMAP/DLPFC/monocyte/o...
1 chr19_ENSG00000130203_P02649 DLPFC_Bennett_pQTL /mnt/vast/hpc/homes/zq2209/work/input/ROSMAP_p... /mnt/vast/hpc/homes/zq2209/work/input/ROSMAP_p...
2 gp_1235_P02649,gp_1235_P02649 DLPFC_Klein_gpQTL_adjusted,DLPFC_Klein_gpQTL_u... /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop... /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop...
3 gp_1236_P02649,gp_1236_P02649 DLPFC_Klein_gpQTL_adjusted,DLPFC_Klein_gpQTL_u... /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop... /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop...
for In fact, can there be cases where different isoforms of the same gene has different coordinates?
if the coordinates here means region, then the answer is yes, the region of isoforms could be different, however, I think as we discussed before, we are still using the same gene region(extended cis window) here. I think maybe analysts are using different ref here, to cause the different start and end.
and them for 2 we group by chr and ID but aggregate it such that the start is the min() of all the start values, end is the max() of all end values, then Original_ID will be pasted with ,.
I will change to this way in next commit
I would prefer a more plain version of the implementation where
setdiff
is not involved. As i discussed with you today,set
by its mathematical definition does not retain order. The actual implementation may be different from one language to another. But it is a behavior we should nto rely on --- it would make a statistician very uncomfortable to use sets in code assuming the order is retained. A simpler version is to keep the exact indices of which results are empty and remove those precise indices.
empty_indices <- which(sapply(results, function(x) length(x) == 0))
empty_elements <- c(empty_elements, empty_indices + length(fitted)) # Adjust the indices based on already added results
this part is in R, and empty_indices here is an index, setdiff
in R used here would report the different index, it should be different from the set
case we discussed today
replaced setdiff
with another way, still without test. (Actually I tried to test it in jupyter notebook, but it complained a lot about
Warning message in optimize(function(par) fn(par, ...)/con$fnscale, lower = lower, :
“NA/Inf replaced by maximum positive value”
to crush my notebook. and I did not see it in the terminal before. And that should not be related to the modification in this PR...) but I think logically it should be correct, and please review it. Additionally, I try to run finemapping in my terminal today, and it complaining
Error in susie(X, Y, L = 1) : could not find function "susie"
Thanks @rfeng2023 what you have reads correct although i still don't know what i did wrong previously (try me tmr but not right now since my brain barely works at this moment).
Error in susie(X, Y, L = 1) : could not find function "susie"
It looks like we also need to add library(susieR)
to the code because you might notice we have been using susie_wrapper from pecotmr , never susie()
function directly until we put in the quick, L=1 test. I will merge after you add that line in.
I've modified the pipeline as we discussed, the regional data in this version would be
{'data': [("['/home/rf2872/Work/leaf_cutter2/ROSMAP_DLPFC_2024/Genotype/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.bed']",
'pheno_path1',
'pheno_path2',
'pheno_path3',
'pheno_path4',
'cov_path1',
'cov_path2',
'cov_path3',
'cov_path4',],
'meta_info': [('chr19:44905790-44909393',
'chr19:41840000-47960000',
'ENSG00000130203',
'chr19_ENSG00000130203_P02649,(gp_1235_P02649,gp_1236_P02649),(gp_1235_P02649,gp_1236_P02649),ENSG00000130203',
'DLPFC_Bennett_pQTL',
'DLPFC_Klein_gpQTL_adjusted',
'DLPFC_Klein_gpQTL_unadjusted',
'Monocyte_ROSMAP_eQTL')]}
The execution of the job was halted after running for some time with this error Error: object 'empty_elements_indices' not found
. I will check with a different region if that helps to investigate more
@gaow
accumulated_pheno_df
is like below,0 chr19 ENSG00000130203 DLPFC_Bennett_pQTL /mnt/vast/hpc/homes/zq2209/work/input/ROSMAP_p... 44905791 /mnt/vast/hpc/homes/zq2209/work/input/ROSMAP_p... 44905790 chr19_ENSG00000130203_P02649 1 chr19 ENSG00000130203 DLPFC_Klein_gpQTL_adjusted,DLPFC_Klein_gpQTL_u... /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop... 44909393 /mnt/vast/hpc/homes/ss7112/QTL_Analysis/Glycop... 44905791 gp_1235_P02649,gp_1236_P02649 2 chr19 ENSG00000130203 Monocyte_ROSMAP_eQTL /home/al4225/xqtl_data/ROSMAP/DLPFC/monocyte/o... 44905791 /home/al4225/xqtl_data/ROSMAP/DLPFC/monocyte/o... 44905790 ENSG00000130203