Global-Biofoundries-Alliance / DNA-scanner

Online tool for comparing prices and feasibility of DNA synthesis
MIT License
17 stars 7 forks source link

Supply DNA sequences to CS students #20

Open jkabisch opened 4 years ago

jkabisch commented 4 years ago

Supply sets of DNA sequences:

three easy to synthesis three difficult each in FASTA, genbank, SBOL

jkabisch commented 4 years ago

Please include comments of why the sequences are difficult to synthesize.

jkabisch commented 4 years ago

I added one easy and one difficult to synthesize sequence. Please follow the example and push your seqeuences into the corresponding folder. (https://github.com/Global-Biofoundries-Alliance/DNA-scanner/tree/master/Example_Sequence_Files)

jkabisch commented 4 years ago

@eoberortner @njhillson @Zulko @neilswainston : please provide additional sequencing which are difficult and easy in all three requested formats. THX

neilswainston commented 4 years ago

Thanks for the prompt, @jkabisch. But do we really need to do this in three formats? Isn't the sequence itself the issue, and therefore fasta should suffice?

njhillson commented 4 years ago

I think one of the specs we gave on the first call was to support the three formats. plain-text/FASTA would be sufficient, but for our integrated work flows, ideally dealing with GenBank and SBOL in a native (not lossy) way would be really important

On Thu, Nov 21, 2019 at 1:32 PM Neil Swainston notifications@github.com wrote:

Thanks for the prompt, @jkabisch https://github.com/jkabisch. But do we really need to do this in three formats? Isn't the sequence itself the issue, and therefore fasta should suffice?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Global-Biofoundries-Alliance/DNA-scanner/issues/20?email_source=notifications&email_token=AA7ALTKMYFZOVHSBW3VANM3QU3465A5CNFSM4JJC45Q2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3XARY#issuecomment-557281351, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7ALTOAIPSZ66OJJ2BFJTDQU3465ANCNFSM4JJC45QQ .

neilswainston commented 4 years ago

Yep, sure. We need to support the three formats. Just looking at this task atomically and assumed its focus was on problematic sequences rather than the formats in which they are supplied.

jkabisch commented 4 years ago

Hi,

the focus is on as many examples as possible that reflect your everyday work. I supplied one simple CDS without any problems and one expression construct (Promoter, regulatory sequiences and three CDS) with a lot of problems. The students already found a parsing library that can handle fasta, gb, and sbol, so that is already no problem.

All the best, Johannes

On 21.11.19 22:42, Neil Swainston wrote:

Yep, sure. We need to support the three formats. Just looking at this task atomically and assumed its focus was on problematic sequences rather than the formats in which they are supplied.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Global-Biofoundries-Alliance/DNA-scanner/issues/20?email_source=notifications&email_token=ANUF7PIF2JSJ77PER3U7KX3QU36ENA5CNFSM4JJC45Q2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEE3X53Q#issuecomment-557285102, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANUF7PPSV3XQVTMDTNRQ6J3QU36ENANCNFSM4JJC45QQ.

-- Jun.-Prof. Dr. Johannes Kabisch Department of Biology Computer-aided Synthetic Biology Technische Universität Darmstadt Schnittspahnstr. 12 Building B2/05 room 107 64287 Darmstadt Germany Tel.: +49 (0)6151 16 22044 email: johannes@kabisch-lab.de web: kabisch-lab.de

Important note: This message including all its attachments is confidential and may be privileged. Any unauthorised dissemination or copying hereof is prohibited. This message serves for information purposes only and shall not have any legally binding effect. Given that emails can easily be subject to manipulation, we cannot accept any liability for the content provided.

eoberortner commented 4 years ago

All, one alternative is to develop a "sequence generator" that can generate sequences according to some constraints. Examples: -- generate sequence with %GC content < 20% -- generate sequence with %GC content > 80% -- generate sequence that contains inverted repeat of length 10bp at position 100 and 200

You can also imagine negating and combining such constraints using logical operators (not, and, or).

Such sequence generator could be very helpful when generating sequences that are easy to synthesize.

neilswainston commented 4 years ago

+1 from me, Ernst. A “dodgy sequence generator” would be straightforward for one of the students to develop, and could be used by anyone developing optimisation algorithms.

Probably need a list of problems that should be encoded. I have high global GC, high local GC and repeating nucleotides. Sure there are many more.

Cheers,

Neil.

Zulko commented 4 years ago

Sorry! As a test set I can provide the linear version of ~80 parts from the EMMA standard kit. The kit contains a mix of large and small parts, promoters, etc. There are definitely easy- and difficult-to-synthesize parts in the kit but I haven't got a classification yet. Would that work?

EDIT: +1 to the suggestions above to generate families of sequences with different characteristics.

EDIT2: sent!

gled0n commented 4 years ago

@Zulko If they are in one of these formats (FASTA, genbank, SBOL) feel free to send them to us.

Edit: thank you

jkabisch commented 4 years ago

@gled0n I pushed some more sequences