cerebis / sim3C

Read-pair simulation of 3C-based sequencing methodologies (HiC, Meta3C, DNase-HiC)
GNU General Public License v3.0
19 stars 5 forks source link

Community profile format is cumbersome #30

Open cerebis opened 5 months ago

cerebis commented 5 months ago

The format used to define a community is presently a simple flat table. This approach incurs a great deal of duplicated information, and a cleaner approach would be a to use JSON or TOML to define a simple object hierarchy.

The fundamental component is just the one-to-many relationship:

Community 1->* Cell 1->* Molecule 1->* Segment

Additional details would become parameters at the relevant object level.

An example prototype definition using TOML

[ecoli]
abundance = 1

[ecoli.chromosome]
copy_number = 1
linear = true
segments = [ "contig_1", "contig_2",]

[ecoli.plasmid]
copy_number = 4
linear = false
segments = [ "contig_3",]

[bsubt]
abundance = 0.5

[bsubt.chromosome]
copy_number = 1
linear = true
segments = [ "contig_4", "contig_5", "contig_6",]

[bsubt.plasmid]
copy_number = 1
linear = false
segments = [ "contig_7",]
cerebis commented 5 months ago

A TOML based profile definition has now been implemented. The format means rather than global definitions for many simulation parameters, these can now be set at the rank at which they take effect.

Level parameter
Community spurious_rate
Cell abundance, trans_rate (intermolecular rate)
Replicon copy_number, linear, anti_rate

In experimenting with this format, I have found that creating the initial definition is best achieved through programmatic means, then serialized to TOML. Afterwards, modification is much easier.

In TOML, "community", "cell" and "replicon" are tables, while "segment" is a simple array of strings.

Example

In Python, TOML tables are deserialised as dictionaries, while arrays become lists. Therefore, a user can go in reverse. The following is a simple community composed of 2 cells and 3 sequences.

community = {
    'spurious_rate': 0.01, 
    'cells': [

        # First cell in community -- in two pieces
        {'name': 'ecoli', 
        'abundance': 0.6, 
        'trans_rate': 0.1, 
        'replicons': [
            {'name': 'chromosome', 
            'copy_number': 1,
            'linear': False,
            'anti_rate': 0,
            'segments': ['contig_1']}
            ]
        },

        # Second cell in community
        {'name': 'saur', 
        'abundance': 0.4, 
        'trans_rate': 0.2, 
        'replicons': [
            {'name': 'chromosome', 
            'copy_number': 1,
            'linear': True,
            'anti_rate': 0.3,
            'segments': ['contig_2', 'contig_3']}
            ]
        },

    ]
}

A larger example of the TOML profile definition

The following involves two cells, but each cell comprises two replicon definitions in various sequence fragments.

[community]
spurious_rate =0.01

[[community.cells]]
name= "ecoli"
abundance = 1
trans_rate = 0.1

[[community.cells.replicons]]
name = "chromosome"
copy_number = 1
linear = true
anti_rate = 0
segments = [ "contig_1", "contig_2",]

[[community.cells.replicons]]
name = "plasmid"
copy_number = 4
linear = false
anti_rate = 0
segments = [ "contig_3",]

[[community.cells]]
name = "bsubt"
abundance = 0.5

[[community.cells.replicons]]
name = "chromosome"
copy_number = 1
linear = true
anti_rate = 0
segments = [ "contig_4", "contig_5", "contig_6",]

[[community.cells.replicons]]
name = "plasmid"
copy_number = 1
linear = false
anti_rate = 0
segments = [ "contig_7",]