broadinstitute / longbow

Annotation and segmentation of MAS-seq data
https://broadinstitute.github.io/longbow/
BSD 3-Clause "New" or "Revised" License
20 stars 3 forks source link

documentation on how to create new models #170

Open krobison13 opened 1 year ago

krobison13 commented 1 year ago

There will be certainly be interest in creating additional models -- e.g. with fewer segments to accommodate larger inserts or other schemes of information embedded in each segment. A tutorial on how to train new models would be very useful

jonn-smith commented 1 year ago

@krobison13 This is an excellent point. We're in the process of refactoring how models are defined, at which point we'll need a new set of instructions anyway.

In the meantime, the easiest way to do this is to start with an existing model and modify it.

You can list the models available using:

$ longbow model -l
[INFO 2022-10-07 14:50:25    model] Invoked via: longbow model -l
Longbow includes the following models:
Name                                    Version  Description
10x_sc_10x5p_single_none                1.0.0    Model for a single cDNA sequence from the 10x 5' kit
mas_15_sc_10x5p_single_none             2.0.1    The standard MAS-seq 15 array element model.
mas_15_sc_10x3p_single_none             2.0.2    The 3' kit MAS-seq 15 array element model.
mas_15_bulk_10x5p_single_internal       1.0.1    A MAS-seq 15 array element model with a 10 base index just before the 3' adapter for bulk sequencing.
mas_10_sc_10x5p_single_none             2.0.1    The MAS-seq 10 array element model.
mas_15_spatial_slide-seq_single_none    2.0.2    The Slide-seq 15 array element model.
mas_15_bulk_teloprimeV2_single_none     2.0.1    The MAS15 Teloprime V2 indexed array element model.
isoseq_1_sc_10x5p_single_none           1.0.1    Single-cell RNA (without MAS-seq prep).

Then dumping one of them to a file:

$ longbow model -d mas_15_sc_10x5p_single_none
[INFO 2022-10-07 14:51:07    model] Invoked via: longbow model -d mas_15_sc_10x5p_single_none
[INFO 2022-10-07 14:51:08    model] Dumping mas_15_sc_10x5p_single_none: The standard MAS-seq 15 array element model.
[INFO 2022-10-07 14:51:08    model] Dumping dotfile: longbow_model_mas_15_sc_10x5p_single_none.v2.0.1.dot
[INFO 2022-10-07 14:51:08    model] Dumping simple dotfile: longbow_model_mas_15_sc_10x5p_single_none.v2.0.1.simple.dot
[INFO 2022-10-07 14:51:08    model] Dumping json model specification: longbow_model_mas_15_sc_10x5p_single_none.v2.0.1.spec.json
[INFO 2022-10-07 14:51:08    model] Dumping dense transition matrix: longbow_model_mas_15_sc_10x5p_single_none.v2.0.1.dense_transition_matrix.pickle
[INFO 2022-10-07 14:51:08    model] Dumping emission distributions: longbow_model_mas_15_sc_10x5p_single_none.v2.0.1.emission_distributions.txt

Then modifying the resulting longbow_model_mas_15_sc_10x5p_single_none.v2.0.1.spec.json file to have the number of elements (or other characteristics) that you want. Changing the number of elements/segments, for example, is as simple as removing and/or adding MAS adapters to the adapter definitions and adding the corresponding array structure lines to the model structure.

As for training, we are currently using the same weights for all models (we haven't trained them all individually yet). We have empirically found that these weights work well for all default models (admittedly some models would work better with customized weights).

jamestwebber commented 1 year ago

Is this page (from #195) a sufficient explanation? Any more detail needed?