TheJacksonLaboratory / cs-nf-pipelines

The Jackson Laboratory Computational Sciences Nextflow based analysis pipelines
MIT License
20 stars 10 forks source link

Creating gbrs_emissions_all_tissues.avecs.npz file #13

Closed lawtyunc closed 5 days ago

lawtyunc commented 5 days ago

Hello JAX lab,

I am trying to generate references to run GBRS and I see that one of the inputs is this gbrs_emissions_all_tissues.avecs.npz file. I know you guys have this file for references but was trying to figure out how I could regenerate this file? Do you have an example of the input meta.data.csv file looks like to make this file. I have been looking at the python script but just a little confused how this input file would be setup? I am having issue getting it to run do to the -m meta data flag

MikeWLloyd commented 5 days ago

We outlined the process here: https://github.com/TheJacksonLaboratory/cs-nf-pipelines/issues/11#issuecomment-2371287519. The metadata file is a csv with fields listed in the linked comment.

lawtyunc commented 5 days ago

With this how does it work if you don't know the strain each sample goes to? would it be something like this? where you change the DO_id for every sample? sampleID,tissue,strain,DO_id,sex,exclude 129_1_Adipose,Adipose,129S1_SvImJ,C,F,NO 129_1_Adipose,Adipose,129S1_SvImJ,D,F,NO 129_1_Adipose,Adipose,129S1_SvImJ,E,F,NO 129_1_Adipose,Adipose,129S1_SvImJ,F,F,NO

lawtyunc commented 5 days ago

Also would EMASE pipeline have to be run? I ran --workflow generate_pseudoreference --workflow prepare_emase to get the inputs for GBRS. What input file for GBRS would I need from the EMASE pipeline?

MikeWLloyd commented 5 days ago

This will not work if you do not know the strains of your samples, or have 'founder' RNA from the strains in your multi-parent samples.

To generate new reference materials for GBRS based on a different multi-parent population, you need 'founder' data from taken from the inbred strains that founded the multi-parent population. I.e., if you have a mixture of A/J, B6 and CAST in your multi-parent individuals, you need RNA data from inbred A/J, B6 and CAST strains to train the model.

In step 3 as outlined in (https://github.com/TheJacksonLaboratory/cs-nf-pipelines/issues/11#issuecomment-2371287519) you are running the inbred strain 'founder' data against the multiway transcriptome you generate in steps 1 and 2 as outlined in the same comment. The results of the analysis are used to train the emission probability of each genotype for each gene.

Yes, you change the DO ID to correspond to the arbitrary naming you assign during the generation of the multi-way transcriptome (e.g, A = A/J, B = B6, C = CAST, ...). You can use whatever lettering you wish for this, but it must be consistent across all steps of reference generation. It is perhaps more correct to say it is not a DO_ID but rather a haplotype ID.

It is possible we will need to adjust the script to generalize for non-DO / non-8-way samples.

lawtyunc commented 5 days ago

So each sample could have 8 rows one representing a new Haplotype ID like for each sample? 129_1_Adipose,Adipose,129S1_SvImJ,A,F,NO 129_1_Adipose,Adipose,129S1_SvImJ,B,F,NO 129_1_Adipose,Adipose,129S1_SvImJ,C,F,NO 129_1_Adipose,Adipose,129S1_SvImJ,D,F,NO 129_1_Adipose,Adipose,129S1_SvImJ,E,F,NO 129_1_Adipose,Adipose,129S1_SvImJ,F,F,NO 129_1_Adipose,Adipose,129S1_SvImJ,GF,NO 129_1_Adipose,Adipose,129S1_SvImJ,H,F,NO

129_2_Adipose,Adipose,129S1_SvImJ,A,F,NO 129_2_Adipose,Adipose,129S1_SvImJ,B,F,NO 129_2_Adipose,Adipose,129S1_SvImJ,C,F,NO

MikeWLloyd commented 5 days ago
A_J_Adipose,Adipose,A_J,A,F,NO
B6_Adipose,Adipose,B6,B,F,NO
CAST_Adipose,Adipose,CAST,C,F,NO
...
lawtyunc commented 5 days ago

Also would EMASE pipeline have to be run? I ran --workflow generate_pseudoreference --workflow prepare_emase to get the inputs for GBRS. What input file for GBRS would I need from the EMASE pipeline?

lawtyunc commented 5 days ago

Also Is there a work around if you don't know the strain of every sample?

MikeWLloyd commented 5 days ago

All steps as outlined in https://github.com/TheJacksonLaboratory/cs-nf-pipelines/issues/11#issuecomment-2371287519 are required to generate new reference materials.

Please see the EMASE wiki for information on inputs required to that workflow.

There is no work around. You must have founder data of known strain for this method.