epam / Indigo

Universal cheminformatics toolkit, utilities and database search tools
http://lifescience.opensource.epam.com
Apache License 2.0
314 stars 102 forks source link

Import/export of variant monomers from IDT #1619

Closed olganaz closed 2 months ago

olganaz commented 9 months ago

Background Sometimes users need to register oligonucleotides containing randomized or "mixed" bases. It means that on the defined position a variant monomer could occur. Variant monomer is a monomer which can be used instead of another monomer within listed variants.

Requirements

In addition to requirements for standard monomers in IDT #1588 The notation used for standard IDT monomers [s]<Base>[*] could be extended with specific symbols which define variants for modified nucleotides (mixed bases).

  1. Two ways of notation for variant monomers should be supported:
    • Standard mixed bases
    • Custom mixed bases
  2. Standard mixed bases are designated using a capital IUB (International Union of Biochemistry) code (see table).

Examples:

  • GTACTGCAATAGrNrNrNTGATCGAGA
  • CTGCAATAATAGTKCTTRTTNGCN
  1. Custom mixed bases are represented in two ways (IUBcode:XXYYZZQQ) or (Ni:XXYYZZQQ), where i=1..4, XX, YY, ZZ, QQ - double digits, indicating the percent ratio for each nucleotide. x- A % ratio, y - C % ratio, z - G % ratio, q - T % ratio for DNA (U % ratio for RNA), XX+YY+ZZ+QQ=100
  2. % ratio should be stored as metainformation.
  3. The first instance of the custom mixed base must name and define the ratio, all subsequent identical insertions only need to include the name (see examples)

Examples:

  • ACTGTACCGTATTCC (N:25252525)(N)(N) TTA (N)(N)(N) ATA An N mixed base with 25% of each base is written as: (N:25252525). Each next mixed base in the sequence with this ratio is notated with (N).
    • CAG +(N:25252525)+(N) TCTACATGTATAAGTA This oligo has two insertions of a mix of 25% for each base (labeled N) with modified sugar.
  1. Up to 4 unique custom-mixed ratios can be included in an oligo sequence. Each of these ratios must also include a unique name (see examples).

Examples:

  • CAT (N:25252525)(N) T (N1:20202040)(N1)(N) G (N1) A This oligo has three insertions of a 25% for each base mix (labeled N) and three insertions of a 20% A, 20% C, 20% G, 40% T mix (labeled N1).
  • AGG (K:00005050)(K)(K)(N1:10002070)(N1) AGTA This oligo has three insertions of a 50% G, 50% T mix (labeled K) and two insertions of a 10% A, 20% G, 70% T mix (labeled N1).
  • AGG (N1:00004060)(N1)(N1)+(N2:10002070)+(N2) AGTA This oligo has three insertions of a 40% G, 60% T mix (labeled N1) as standard DNA and two insertions of a 10% A, 20% G, 70% T mix (labeled N2) with modified sugar.
AliaksandrDziarkach commented 2 months ago

As per IDT documentation T option replaced with U for RNA: image

AlexeyGirin commented 1 month ago

Verified.

Versions