General Questions - Githubissues

cutleraging commented 10 months ago

Hi Burçak,

Can you please help me answer some questions about your great program...

What do these parameters do?

epigenomics_dna_elements
mutation_annotation_integration
sample_based
lncRNA

What is the difference between providing probabilities and running in 'aggregated' mode? I find for my data I will not get results when using the probabilities but I will when running in 'aggregated' mode. From the intermediate files, it looks like 'aggregated' mode still calculates the probabilities?

My samples are IMR90 cells, is it correct that the library you have for IMR90 data is only for replication_time and replication_time_strand_bias?

I am interested in calculating observed / expected ratios using genome annotations such as ChromHMM. How can I do this with your program?

I see for the Epigenomics Occupancy analysis that an observed and simulated signal are being compared. Can you explain what this is? Is this the amount of mutations? Or does it have something to do with the signal from the epigenetic assay?

Thanks a lot! Ronnie

burcakotlu commented 10 months ago

Dear Ronnie,

Thanks for your questions.

Epigenomics_dna_elements are the short versions of the DNA elements of interest. Each epigenomics DNA element must be contained in at least one epigenomics file.

e.g., When the epigenomics file name is ENCFF045ZYD_upper-lobe-of-left-lung_H3K4me3-human.bed, the DNA element of H3K4me3 will consider this file's result since H3K4me3 is contained in the corresponding filename (ENCFF045ZYD_upper-lobe-of-left-lung_H3K4me3-human.bed). This allows the consideration of multiple epigenomics files for H3K4me3.

mutation_annotation_integration
sample_based
lncRNA These are some parameters of the SPT that are not maintained in the current version of SPT. e.g., sample_based results in a huge number of figures if we separately provide topography figures for each sample. I have hidden these parameters for the sake of simplicity.
Aggregated mode considers all mutations, whereas signature-based analysis considers the mutations assigned to the signature of interest. In aggregate mode, the probabilities are not taken into account.
You can download histone modifications and transcription factor binding sites files from ENCODE (in bed format) and run the SPT.
Could you please provide information about the genome annotations of ChromHMM?
I'm hoping that our manuscript will be available on BioRxiv so that all these analyses will be more clear.
You can see the updated SPT version 1.0.81.

Best wishes, Burcak

burcakotlu commented 10 months ago

I'm closing this issue. If you have any further questions, please let me know.

cutleraging commented 10 months ago

Hi Burcak,

Thanks for the responses!

Can you explain a bit more about aggregated mode? What do you mean by the probabilities are not taken into account? How will this affect the results as compared to running with the probabilities? Because in the results of aggregated mode I still see signature-specific results.
To clarify in regards to the chromHMM question. As mentioned, I am interested in calculating observed / expected ratios using this genome annotation. This is what it looks like...

chr1    841999  842400
chr1    845599  846000
chr1    858599  858800
chr1    876399  877000
chr1    901199  901600
chr1    937399  937600
chr1    940399  941200
chr1    949799  950200

You can see that it is just genomic ranges, and no signal is associated with this. So I am wondering if the signal column is required for your program or if it is possible to observed / expected ratios just given genomic ranges?

Thanks, Ronnie

cutleraging commented 10 months ago

Hi @burcakotlu, wondering if you saw my last comment here. It seems to have closed prematurely.

Best, Ronnie

burcakotlu commented 10 months ago

Dear Ronnie,

In aggregated mode, all mutations are considered in the topography analyses without assigning mutations to each specific mutational signature. In signature-specific mode, mutations are considered for the signatures they are assigned through probabilities. SPT provides resulting figures both for aggregated mode (considering all mutations) and for each specific mutational signature as long as this is possible.
Our occupancy analyses are designed for library files having signal columns (e.g., ENCODE ChIP-seq narrow peak bed files). Unfortunately, it won't work for chromHMM files with genomic ranges only. Or you can add/provide a 4th column (1-based) containing signal values of 1 by default.

I couldn't see this message on GitHub before, but I got an email and replied to that email on Dec 31, 2023. Now, that email has also shown up on GitHub. So, I copied my former answer here.

If you have any questions, please let me know.

Best wishes, Happy New Year! Burcak Otlu

cutleraging commented 10 months ago

HI @burcakotlu,

Thanks for the reply! In regards to 2, if I just put 1 for the signal, how should I interpret the output files then? Does it makes sense to do? As I've mentioned, I am interested in calculating observed / expected ratios.

Thanks, Ronnie

burcakotlu commented 10 months ago

Dear Ronnie,

Since we don't know the signal values, I suggest providing a signal value of 1. In this way, you can only compare whether your mutations are preferably falling into these regions as compared to the simulated mutations. However, this is a suboptimal solution for using these files as we cannot provide signal differences among the regions in these files.

Best, Burcak

burcakotlu commented 10 months ago

Dear Ronnie,

I will close this issue.
If you have any questions, please feel free to ask.

Best wishes, Burcak

AlexandrovLab / SigProfilerTopography

General Questions #5