Closed rfeng2023 closed 6 months ago
Check out this pull request on
See visual diffs & provide feedback on Jupyter Notebooks.
Powered by ReviewNB
thanks @rfeng2023 am not sure if it is going to work because it looks like you deleted lots of existing working codes ... the desired output is:
chr12 752578 752579 652578 852579 ENSG00000060237 Q9H4A3,P62873 protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz covar_1.gz,covar_2.gz trait_A,trait_B chr12:752578-752579 protocol_example.genotype.chr21_22.bed
the Q9H4A3,P62873
is joined by ,
. Are you sure your new code will achieve the same? The ID mapping file looks like:
Q9H4A3 ENSG00000060237
P62873 ENSG00000060237
like two splicing events on the first column and the gene they belong to on the 2nd column.
that's my existing meta with that new function for one gene
the problem is I found that you pasted everything for output name with using such original ID ({_meta_info[3].replace(",","_")}
), as you can see that would be super long?
that's my existing meta with that new function for one gene
Really? From your code I don't see how you could even make it happen with so few lines of code ... For example I don't see how you were able to consolidate them with ,
separated.
as you can see that would be super long?
You are right. Maybe we want to just use the gene ID (meta_info[2]) as the output file name?
consolidate them with , separated is by
combined_df = accumulated_pheno_df.groupby(accumulated_pheno_df.columns.difference(['Original_ID']).tolist(), as_index=False).agg({'Original_ID': ','.join})
and have fixed output name by gene ID in new commit
I got below error with running
sos run ~/codes/xqtl-pipeline/pipeline/cis_workhorse.ipynb susie_twas \
--name ROSMAP_eQTL \
--genoFile /mnt/vast/hpc/csg/FunGen_xQTL/ROSMAP/Genotype/geno_by_chrom/ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.11.bed \
--phenoFile /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/Ast/output/data_preprocessing/phenotype_data/snuc_pseudo_bulk.Ast.normalized.log2cpm.region_list.txt \
/mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/Exc/output/data_preprocessing/phenotype_data/snuc_pseudo_bulk.Exc.normalized.log2cpm.region_list.txt \
--covFile /mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/Ast/output/data_preprocessing/covariate_data/snuc_pseudo_bulk.Ast.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
/mnt/vast/hpc/csg/wanggroup/fungen-xqtl-analysis/analysis/Wang_Columbia/ROSMAP/pseudo_bulk_eqtl_kelli/Exc/output/data_preprocessing/covariate_data/snuc_pseudo_bulk.Exc.normalized.log2cpm.rosmap_cov.ROSMAP_NIA_WGS.leftnorm.bcftools_qc.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \
--customized-cis-windows ~/codes/fungen-xqtl-analysis/resource/TADB_enhanced_cis.coding.bed \
--phenotype-names Ast Exc \
--no-fine-mapping --no-twas-weights \
--cwd ROSMAP_eQTL_data_extract \
--region-name ENSG00000073921
to extract data.. not sure what does that mean
ERROR: [susie_twas_1]: [susie_twas_1]: Failed to process step output (f'{cwd:a}/{step_name[:-2]}/{name}.{_meta_info[2].replace(",","_")}.univariate{"_susie" if fine_mapping else ""}{"_twas_weights" if twas_weights else ""}.rds'): Output /mnt/vast/hpc/homes/rf2872/test/ROSMAP_eQTL_data_extract/susie_twas/ROSMAP_eQTL. ENSG00000073921.univariate.rds from substep 1 of 2 substeps overlaps with output from a previous substep.
@rfeng2023 its great that you seem to be able to code it up so much more concisely than what Zoey did the other day!
The error you run into means your output file names have duplicates. Since you use the ID column directly now into filename, it means in your combined_df table the ID column has duplicates from that table. We checked and made sure that the original ID does not duplicate. However we did not check about duplicates in the ID column -- these columns should have different start positions.
If you use this logic:
{"_".,join([chrom, start_pos, end_pos, _meta_info[2]]) if _meta_info[2] != _meta_info[3] else _meta_info[2]}
that would "fix" the issue. The problem with the above proposal is that for each gene you will save different events to different places. But at least you can implement that for your toy data and see.
To keep events within the same gene into the same file, the other idea that may work is this --- you turn the data below:
chr1 12315 12316 ENSG000001 isoform1,isoform2
chr1 12340 12341 ENSG000001 isoform3
Into
chr1 12315 12341 ENSG000001 isoform1,isoform2,isoform3
That is, to further combine rows by the ID column using min(start) and max(end). This will give you unique ID column.
Then in your data extraction code you will extract and get this range of values into the output of tabix_region which may also contain other genes but it is okay because in the tabix_region function you will also further filter it by the Original ID column i believe, so you only subset the lines including isoform1, isoform2 and isoform3.
I hope what I explained makes sense to you.
Thanks for reminding the duplicates, I think the problem is that I did not aggregate conditions etc. in function and leave duplicate ID in the regional data. I've tested with sQTL and eQTL data and that works.
Hi @gaow please review code/cis_analysis/cis_workhorse.ipynb before you merge it.