Include statement - Githubissues

CamelCaseCam commented 9 months ago

Have you considered adding an include statement field to divide content from toolkits between multiple organisms? Otherwise, these files could get pretty big.

I'm thinking something simple like includes: ["x", "y", "z"]

Koeng101 commented 9 months ago

Could you give a specific example based off of the current schema?

CamelCaseCam commented 9 months ago

yup - if we look at the pichia file in my fork, it reads

pAOX1:
    ...
pGAP:
    ...

and so on in the promoter section. I would like to have a file called plasmids at collections/pichia and instead put includes: ["pichia/plasmids"] to keep things more organized.

Koeng101 commented 9 months ago

Oh, so lemme get this straight:

collections.yaml, rather than just having something like file: collections/pichia.yaml, you have includes: ["pichia/plasmids/promoter.yaml"], right?

CamelCaseCam commented 9 months ago

Not exactly. You'd still have file: collections/pichia.yaml, but in pichia.yaml I could put includes: ["pichia/plasmids/promoter.yaml"]

Koeng101 commented 9 months ago

Why not just have that in collections.yaml then? includes in the collection file (pichia.yaml) would make the depth of files 1 layer larger, while breaking what could be seen as the structure of the file (ie, pichia.yaml is a list of parts).

To take a step back on the overall design, I think the one thing I'm really going for is keeping it as simple / maintainable as possible, while still having enough information for automated systems to work on the sequences. Your pichia link is actually pretty much the perfect amount of information (except references should be a list). It has the following characteristics:

Name that can go into a URL
A description that increases the searchability of the part through literature
A reference to get started with important information about the part
Prefix / suffix for quickly validating that parts can go together
A full sequence for the part itself

Rather than something like iGEM (pRham promoter for example), I'd like to embrace and go for something more along the lines of what futurehouse's wikicrow is doing. The actual documentation for each part is contained deep within the scientific literature, which we now have the ability to crawl with LLMs.

On the other side, I'd like to embrace teaching-by-example, so hand-crafting part combinations with descriptions of what they do in order to serve as training data for LLMs and humans to put together parts. But that is separate.

Otherwise, these files could get pretty big.

They don't get too big, usually <200kb per toolkit. The limiting factor is pretty much guaranteed to be "oh shit I can't reasonably synthesize this".

CamelCaseCam commented 9 months ago

Fair enough - that makes sense!

Koeng101 / parts

Include statement #4