daisybio / ASimulatoR

ASimulatoR: splice-aware RNA-seq data simulation https://doi.org/10.1093/bioinformatics/btab142
GNU Lesser General Public License v3.0
6 stars 1 forks source link

More documentation on output files #4

Open KuechlerO opened 1 year ago

KuechlerO commented 1 year ago

Hey guys, thx for the great tool!

I was just wondering whether I have missed something or is there just not more documentation on the output files available?

My specific questions:

  1. event_annotation.tsv: What exactly are the columns (e.g. what do the columns genomic_start, and genomic_end display; why are not only splice-variants, but also the templates listed in this file?)
  2. sim_tx_info.txt: What is the difference between foldchange.V1, and foldchange.c2? (What is V1, and what is c2?)
quirinmanz commented 1 year ago

Hi,

  1. event_annotation.tsv: What exactly are the columns (e.g. what do the columns genomic_start, and genomic_end display; why are not only splice-variants, but also the templates listed in this file?)

The columns genomic_start and genomic_end document the location, i.e., the start and end of the splicing event, on a genomic level. transcriptomic_start and transcriptomic_end document this on the transcriptomic level. The templates are given such that there is a clear reference for each event. When there are multiple splice variants for the same reference, other events could arise between these two splice variants.

  1. sim_tx_info.txt: What is the difference between foldchange.V1, and foldchange.c2? (What is V1, and what is c2?)

The columns in sim_tx_info.txt correspond to the groups. I agree that their naming is confusing. This is because they only have indices internally. I will try to fix this.

Does this help? Best, Quirin

KuechlerO commented 1 year ago

Cool, yes this helps. Thx for the quick reply! :)

Another point: It is not obvious to me, which group is control and which one is not. Or in general, how the groups are created.

In your example, you write:

# define, how many groups and samples per group you analyze. Here we create a small experiment with two groups with one sample per group:
num_reps = c(1,1)

Ok, so 2 groups means that one is control, and the other one is the variant group? Is the first group the control group? What happens if I choose >2 groups?

quirinmanz commented 1 year ago

Currently, there is no clear distinction between the groups. This tool only supports the functionality provided by polyester. The fold changes documented in sim_tx_info.txt are introduced randomly. In principle, the groups do not care about the variants, as polyester just simulates from the transcripts given by the ASimulatoR.

KuechlerO commented 1 year ago

Mhm, ok. So the splitting in groups is just introduced for downstream fold change simulations with polyester?!

One more question: Could you also please explain the exact effect of event_probs? My undertanding: The event-frequency gives the frequency for the specific variant to appear in the given gene (in each sample?!). But then, why is the sum of the event-frequencies restricted to sum(event_freq) == 1?

My actual goal is:

  1. Simulate RNAseq reads for first group without any variants --> Have it as controls
  2. Simulate RNAseq reads for second group with variants --> Have this as patient cohort So in the second group I would like to set for specific genes to have splice variants. This should appear at specific frequencies: E.g. 100%, so all reads are following a specific splice pattern --> E.g. homozygous variant that destroys a splice site.

--> As far as I have understood, this could right now only be achieved by starting a separate run for each splice variant and setting the event_freq=1. Am I right?

Thx for your help! :)

quirinmanz commented 1 year ago

Mhm, ok. So the splitting in groups is just introduced for downstream fold change simulations with polyester?!

Yes, the groups are a parameter for polyester.

One more question: Could you also please explain the exact effect of event_probs? My undertanding: The event-frequency gives the frequency for the specific variant to appear in the given gene (in each sample?!). But then, why is the sum of the event-frequencies restricted to sum(event_freq) == 1?

From the README: Probability: For each superset we create an event with the probability mentioned in event_prob. Frequency: Set probs_as_freq = T. The exon supersets are partitioned corresponding to the event_prob parameter.

and

Named list/vector containing numerics corresponding to the probabilites to create the event (combination). If probs_as_freq is TRUE event_probs correspond to the relative frequency of occurences for the event(combination) and in this case the sum of all frequencies has to be <=1.

My actual goal is: 1.Simulate RNAseq reads for first group without any variants --> Have it as controls 2.Simulate RNAseq reads for second group with variants --> Have this as patient cohort So in the second group I would like to set for specific genes to have splice variants. This should appear at specific frequencies: E.g. 100%, so all reads are following a specific splice pattern --> E.g. homozygous variant that destroys a splice site.

ASimulatoR was created to benchmark event detection tools. If I understand correctly, you are analyzing differential splicing and isoform switching.

--> As far as I have understood, this could right now only be achieved by starting a separate run for each splice variant and setting the event_freq=1.

You could still create gtfs with splice events using ASimulatoR and then give this custom gtf to polyester with your own fold_change table. This is not recommended, but should work. An example is attached. I added .txt because GitHub doesn't allow attaching Rscripts.

test_script.R.txt