update codes related to sQTL

rfeng2023 commented 6 months ago

Hi @gaow please review code/cis_analysis/cis_workhorse.ipynb before you merge it.

review-notebook-app[bot] commented 6 months ago

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

gaow commented 6 months ago

thanks @rfeng2023 am not sure if it is going to work because it looks like you deleted lots of existing working codes ... the desired output is:

 chr12   752578   752579  652578   852579  ENSG00000060237  Q9H4A3,P62873  protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz  covar_1.gz,covar_2.gz  trait_A,trait_B    chr12:752578-752579  protocol_example.genotype.chr21_22.bed

the Q9H4A3,P62873 is joined by ,. Are you sure your new code will achieve the same? The ID mapping file looks like:

Q9H4A3 ENSG00000060237
P62873 ENSG00000060237

like two splicing events on the first column and the gene they belong to on the 2nd column.

rfeng2023 commented 6 months ago

that's my existing meta with that new function for one gene the problem is I found that you pasted everything for output name with using such original ID ({_meta_info[3].replace(",","_")}), as you can see that would be super long?

gaow commented 6 months ago

that's my existing meta with that new function for one gene

Really? From your code I don't see how you could even make it happen with so few lines of code ... For example I don't see how you were able to consolidate them with , separated.

as you can see that would be super long?

You are right. Maybe we want to just use the gene ID (meta_info[2]) as the output file name?

rfeng2023 commented 6 months ago

consolidate them with , separated is by

combined_df = accumulated_pheno_df.groupby(accumulated_pheno_df.columns.difference(['Original_ID']).tolist(), as_index=False).agg({'Original_ID': ','.join})

and have fixed output name by gene ID in new commit

rfeng2023 commented 6 months ago

I got below error with running

sos run ~/codes/xqtl-pipeline/pipeline/cis_workhorse.ipynb susie_twas   \
   --name ROSMAP_eQTL  \
   --genoFile /mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/geno_by_chrom/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.11.bed    \
   --phenoFile /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/Ast/output/data_preprocessing/phenotype_data/snuc_pseudo_bulk.Ast.normalized.log2cpm.region_list.txt  \
   /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/Exc/output/data_preprocessing/phenotype_data/snuc_pseudo_bulk.Exc.normalized.log2cpm.region_list.txt  \
   --covFile  /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/Ast/output/data_preprocessing/covariate_data/snuc_pseudo_bulk.Ast.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
   /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/Exc/output/data_preprocessing/covariate_data/snuc_pseudo_bulk.Exc.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
   --customized-cis-windows ~/codes/fungen-xqtl-analysis/resource/TADB_enhanced_cis.coding.bed   \
   --phenotype-names  Ast Exc \
   --no-fine-mapping --no-twas-weights \
   --cwd ROSMAP_eQTL_data_extract \
   --region-name ENSG00000073921

to extract data.. not sure what does that mean

ERROR: [susie_twas_1]: [susie_twas_1]: Failed to process step output (f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[2].replace(",","_")}.univariate{"_susie" if fine_mapping else ""}{"_twas_weights" if twas_weights else ""}.rds'): Output /mnt/vast/hpc/homes/rf2872/test/ROSMAP_eQTL_data_extract/susie_twas/ROSMAP_eQTL. ENSG00000073921.univariate.rds from substep 1 of 2 substeps overlaps with output from a previous substep.

gaow commented 6 months ago

@rfeng2023 its great that you seem to be able to code it up so much more concisely than what Zoey did the other day!

The error you run into means your output file names have duplicates. Since you use the ID column directly now into filename, it means in your combined_df table the ID column has duplicates from that table. We checked and made sure that the original ID does not duplicate. However we did not check about duplicates in the ID column -- these columns should have different start positions.

If you use this logic:

{"_".,join([chrom, start_pos,  end_pos,  _meta_info[2]]) if _meta_info[2] != _meta_info[3] else _meta_info[2]}

that would "fix" the issue. The problem with the above proposal is that for each gene you will save different events to different places. But at least you can implement that for your toy data and see.

To keep events within the same gene into the same file, the other idea that may work is this --- you turn the data below:

chr1 12315 12316  ENSG000001 isoform1,isoform2
chr1 12340 12341  ENSG000001 isoform3

Into

chr1 12315 12341  ENSG000001 isoform1,isoform2,isoform3

That is, to further combine rows by the ID column using min(start) and max(end). This will give you unique ID column.

Then in your data extraction code you will extract and get this range of values into the output of tabix_region which may also contain other genes but it is okay because in the tabix_region function you will also further filter it by the Original ID column i believe, so you only subset the lines including isoform1, isoform2 and isoform3.

I hope what I explained makes sense to you.

rfeng2023 commented 6 months ago

Thanks for reminding the duplicates, I think the problem is that I did not aggregate conditions etc. in function and leave duplicate ID in the regional data. I've tested with sQTL and eQTL data and that works.

cumc / xqtl-protocol

update codes related to sQTL #870