Open jkabisch opened 4 years ago
Please include comments of why the sequences are difficult to synthesize.
I added one easy and one difficult to synthesize sequence. Please follow the example and push your seqeuences into the corresponding folder. (https://github.com/Global-Biofoundries-Alliance/DNA-scanner/tree/master/Example_Sequence_Files)
@eoberortner @njhillson @Zulko @neilswainston : please provide additional sequencing which are difficult and easy in all three requested formats. THX
Thanks for the prompt, @jkabisch. But do we really need to do this in three formats? Isn't the sequence itself the issue, and therefore fasta should suffice?
I think one of the specs we gave on the first call was to support the three formats. plain-text/FASTA would be sufficient, but for our integrated work flows, ideally dealing with GenBank and SBOL in a native (not lossy) way would be really important
On Thu, Nov 21, 2019 at 1:32 PM Neil Swainston notifications@github.com wrote:
Thanks for the prompt, @jkabisch https://github.com/jkabisch. But do we really need to do this in three formats? Isn't the sequence itself the issue, and therefore fasta should suffice?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Global-Biofoundries-Alliance/DNA-scanner/issues/20?email_source=notifications&email_token=AA7ALTKMYFZOVHSBW3VANM3QU3465A5CNFSM4JJC45Q2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3XARY#issuecomment-557281351, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7ALTOAIPSZ66OJJ2BFJTDQU3465ANCNFSM4JJC45QQ .
Yep, sure. We need to support the three formats. Just looking at this task atomically and assumed its focus was on problematic sequences rather than the formats in which they are supplied.
Hi,
the focus is on as many examples as possible that reflect your everyday work. I supplied one simple CDS without any problems and one expression construct (Promoter, regulatory sequiences and three CDS) with a lot of problems. The students already found a parsing library that can handle fasta, gb, and sbol, so that is already no problem.
All the best, Johannes
On 21.11.19 22:42, Neil Swainston wrote:
Yep, sure. We need to support the three formats. Just looking at this task atomically and assumed its focus was on problematic sequences rather than the formats in which they are supplied.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Global-Biofoundries-Alliance/DNA-scanner/issues/20?email_source=notifications&email_token=ANUF7PIF2JSJ77PER3U7KX3QU36ENA5CNFSM4JJC45Q2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3X53Q#issuecomment-557285102, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANUF7PPSV3XQVTMDTNRQ6J3QU36ENANCNFSM4JJC45QQ.
Important note: This message including all its attachments is confidential and may be privileged. Any unauthorised dissemination or copying hereof is prohibited. This message serves for information purposes only and shall not have any legally binding effect. Given that emails can easily be subject to manipulation, we cannot accept any liability for the content provided.
All, one alternative is to develop a "sequence generator" that can generate sequences according to some constraints. Examples: -- generate sequence with %GC content < 20% -- generate sequence with %GC content > 80% -- generate sequence that contains inverted repeat of length 10bp at position 100 and 200
You can also imagine negating and combining such constraints using logical operators (not, and, or).
Such sequence generator could be very helpful when generating sequences that are easy to synthesize.
+1 from me, Ernst. A “dodgy sequence generator” would be straightforward for one of the students to develop, and could be used by anyone developing optimisation algorithms.
Probably need a list of problems that should be encoded. I have high global GC, high local GC and repeating nucleotides. Sure there are many more.
Cheers,
Neil.
Sorry! As a test set I can provide the linear version of ~80 parts from the EMMA standard kit. The kit contains a mix of large and small parts, promoters, etc. There are definitely easy- and difficult-to-synthesize parts in the kit but I haven't got a classification yet. Would that work?
EDIT: +1 to the suggestions above to generate families of sequences with different characteristics.
EDIT2: sent!
@Zulko If they are in one of these formats (FASTA, genbank, SBOL) feel free to send them to us.
Edit: thank you
@gled0n I pushed some more sequences
Supply sets of DNA sequences:
three easy to synthesis three difficult each in FASTA, genbank, SBOL