bebop / poly

A Go package for engineering organisms.
https://pkg.go.dev/github.com/bebop/poly
MIT License
663 stars 70 forks source link

io->bio #339

Closed Koeng101 closed 8 months ago

Koeng101 commented 12 months ago

This is a fairly large PR to standardize our parsing and writing of files, using generics to implement higher level functions on simplified interfaces that are standardized across all our readers/writers: genbank, gff, fasta, fastq, slow5, sam, uniprot, rebase, pileup

Koeng101 commented 12 months ago

This is a work-in-progress, btw

carreter commented 11 months ago

Why the io to bio rename?

Koeng101 commented 11 months ago

Why the io to bio rename?

So that the name doesn't conflict with standard library io

Koeng101 commented 11 months ago

fasta parser testing is now at 98.5%, other than the Scanner non-EOF error, which I can't really figure out how to test after a very brief trying-to-figure-out. I removed a ton of code, and everything should be fairly simple now.

Koeng101 commented 11 months ago

fixes #325

Koeng101 commented 11 months ago

fixes #327

Koeng101 commented 11 months ago

I'm gonna say this is probably no longer a work-in-progress. @TimothyStiles review?

Koeng101 commented 11 months ago

Can we move all the bio/<format>/data files to test/data or testdata/?

The difficulty there would be if any of the specific parsers require the testdata, they can't really import ... So we'd have to have 2 copies, or we just have different test data for the specific parsers and generic parsers.

Koeng101 commented 11 months ago

Can we move all the bio/<format>/data files to test/data or testdata/?

The difficulty there would be if any of the specific parsers require the testdata, they can't really import ... So we'd have to have 2 copies, or we just have different test data for the specific parsers and generic parsers.

I'm completely wrong here actually, I was thinking of go:embed. Yes we can do this.

===

Ironically enough though, all the test data used to live in the top level directory github.com/TimothyStiles/poly/data, but it got a little confusing with all the different file formats, so there was a move towards having the test files locally.

Koeng101 commented 11 months ago
Auto-merging bio/genbank/example_test.go
Auto-merging bio/genbank/genbank_test.go
Auto-merging bio/gff/example_test.go
Auto-merging bio/pileup/pileup.go
CONFLICT (content): Merge conflict in bio/pileup/pileup.go
Auto-merging bio/pileup/pileup_test.go
CONFLICT (content): Merge conflict in bio/pileup/pileup_test.go
Auto-merging bio/polyjson/example_test.go
Auto-merging bio/slow5/slow5.go
Auto-merging bio/uniprot/example_test.go
CONFLICT (content): Merge conflict in bio/uniprot/example_test.go
CONFLICT (modify/delete): io/example_test.go deleted in HEAD and modified in main.  Version main of io/example_test.go left in tree.
CONFLICT (modify/delete): io/fasta/fasta.go deleted in HEAD and modified in main.  Version main of io/fasta/fasta.go left in tree.
Auto-merging seqhash/seqhash_test.go
Auto-merging synthesis/codon/codon_test.go
Auto-merging synthesis/codon/example_test.go
Automatic merge failed; fix conflicts and then commit the result.

Good lord, going through and fixing.

Koeng101 commented 11 months ago

@TimothyStiles Just made the genbank parser compatible. All it cost was... nuking all of the tests, because they pretty much all relied on read functions. Those I'll have to add back in later.

Koeng101 commented 11 months ago

Ah crap, other files depend on genbank.Read... I'll fix later.

carreter commented 11 months ago

Crap, seems like the line numbers on my last review got messed up because I hadn't synced VSCode with the remote repo. Let me know if you have any questions @Koeng101

Koeng101 commented 11 months ago
carreter commented 11 months ago

Don't forget to credit @folago :

Co-authored-by: name <name@example.com>

carreter commented 11 months ago

Fixes #364 .

Koeng101 commented 9 months ago

Ok, it is more important I think, to get this merged than be perfect. I'm doing the following.

I will be merging genbank tests and Next comments. Upcoming...

Koeng101 commented 9 months ago

If we want to go ahead with this, we would spawn 3 new issues:

Koeng101 commented 9 months ago

TODO:

Koeng101 commented 9 months ago

TODO:

* Internalize parse parameters

* Add back fasta format in docs

* Remove that extra space

Parse parameters are actually there for reason. it's reset fairly often, so it is easier to have it outside the parser itself. We could have a separate method for fully resetting it, but I think it is fairly readable as is.

Koeng101 commented 9 months ago

Alright, looking over this it looks good to get a review from @TimothyStiles

TimothyStiles commented 9 months ago

@Koeng101

I like the idea of utilizing generics but am not too familiar with their usage and syntax. Can you add a little "how-to" guide for writing parsers and writer in this new setup?

Koeng101 commented 8 months ago

@Koeng101

I like the idea of utilizing generics but am not too familiar with their usage and syntax. Can you add a little "how-to" guide for writing parsers and writer in this new setup?

Hmmm, for both of those it doesn't really have much to do with generics (mostly just interfaces) - you only need generics once you make the top level convenience-function, which is essentially copy-pasted over each parser type, because Go generics are limited. Really what you have to be concerned about when writing parsers is the interfaces they must implement. To be clearer with the generics:

// NewFastaParser initiates a new FASTA parser from an io.Reader.
func NewFastaParser(r io.Reader) (*Parser[*fasta.Record, *fasta.Header], error) {
    return NewFastaParserWithMaxLineLength(r, DefaultMaxLengths[Fasta])
}

// NewFastaParserWithMaxLineLength initiates a new FASTA parser from an
// io.Reader and a user-given maxLineLength.
func NewFastaParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fasta.Record, *fasta.Header], error) {
    return &Parser[*fasta.Record, *fasta.Header]{parserInterface: fasta.NewParser(r, maxLineLength)}, nil
}

// NewFastqParser initiates a new FASTQ parser from an io.Reader.
func NewFastqParser(r io.Reader) (*Parser[*fastq.Read, *fastq.Header], error) {
    return NewFastqParserWithMaxLineLength(r, DefaultMaxLengths[Fastq])
}

// NewFastqParserWithMaxLineLength initiates a new FASTQ parser from an
// io.Reader and a user-given maxLineLength.
func NewFastqParserWithMaxLineLength(r io.Reader, maxLineLength int) (*Parser[*fastq.Read, *fastq.Header], error) {
    return &Parser[*fastq.Read, *fastq.Header]{parserInterface: fastq.NewParser(r, maxLineLength)}, nil
}

What you can notice from both of these is that the Fastq is literally the same as the Fasta, except the Data and Header now points to fastq types rather than fasta types. The only reason you have to copy and paste this all the time instead of giving the format as a variable into a function is due to the limitation of Golang types.

writers + parsers

"How to" make a parser is almost entirely contained within these few lines (as part of the type system):

type parserInterface[Data io.WriterTo, Header io.WriterTo] interface {
    Header() (Header, error)
    Next() (Data, error)
}

type Parser[Data io.WriterTo, Header io.WriterTo] struct {
    parserInterface parserInterface[Data, Header]
}

Let's dive into the Golang type system to understand it better. A Parser is a generic struct, which means it is any struct that implements a parserInterface. What does that mean? Well, it means that any struct that implements the methods Next() and Header() satisfies the parserInterface. Both Next() and Header() just return an io.WriterTo and err, which is a standard library interface + primitive.

To sum it up, a Parser is any struct that returns a io.WriterTo+err when you call Header() or Next(). It is extremely simple.

For example, the following parser would satisfy the bio.Parser:

type SimpleHeader struct {}
type SimpleData struct {}
func (h SimpleHeader) WriteTo(w io.Writer) (n int64, err error) { return 0, nil}
func (d SimpleData) WriteTo(w io.Writer) (n int64, err error) { return 0, nil}

type SimpleParser struct{}
func (p SimpleParser) Header() (SimpleHeader, error) { return SimpleHeader{}, nil}
func (p SimpleParser) Next() (SimpleData, error) { return SimpleData{}, nil }

In this example, we have all the fundamentals needed:

  1. A Header
  2. A Data
  3. An io.WriteTo implemented on the Header
  4. An io.WriteTo implemented on the Data
  5. A Parser
  6. A Header() implemented on the parser
  7. A Next() implemented on the parser

The clever bit is realizing that pretty much every single biological data structure has something along these lines, and that Next() supports stream-reading - and as a side effect of implementing the io.WriteTo interface, you also have writing built-in to the type definitions in a standard-library supported way.

In this way, the interfaces here are essentially just forcing parsers to expose 4 functions (WriteTo x2, Next, Header), and by doing that on their underlying data structures, you now can have a completely standardized way of working with each parser.

wrapping it up

Let's say you want to implement SimpleParser as a top level parser. Here is how you would do that:

func NewSimpleParser() (*Parser[*SimpleData, *SimpleHeader], error) {
   return &Parser[*SimpleData, *SimpleHeader]{parserInterface: SimpleParser{}}, nil
}

All this does is:

  1. Say that we are returning a Parser with the two types SimpleData and SimpleHeader
  2. Implements the parserInterface needed for a Parser (because SimpleParser has those handy Next() and Header() functions)
  3. returns the instantiated generic Parser, wrapping that underlying SimpleParser

why we couldn't do this before generics

The hard bit is that I wanted to expose the underlying data types when you do things with a Parser. To pull from some above examples:

parser, _ := bio.NewFastaParser(file)
fastaRecord, _ := parser.Next()
// you should be able to do the below!
fmt.Println(fastaRecord.Sequence)
fmt.Println(fastaRecord.Read)

Without a generic interface, we couldn't say that it was aight for Data to contain other things, like Sequence or Name, beyond the required WriteTo. Essentially, all the generics are doing is saying it is aight for Data to contain other things without having to go through complicated type conversion magic.

paleale commented 8 months ago

Hello! Can I ask you to see and apply my changes for #352 ? I don't know a proper way for such changes to be suggested, so I've prepared a patch: 0001-remove-not-working-error-handling-from-GetSequence-f.patch Changes in CONTRIBUTION.md were discussed in #352 also... Thank you!

Koeng101 commented 8 months ago

Hello! Can I ask you to see and apply my changes for #352 ? I don't know a proper way for such changes to be suggested, so I've prepared a patch: 0001-remove-not-working-error-handling-from-GetSequence-f.patch Changes in CONTRIBUTION.md were discussed in #352 also... Thank you!

Sure! Could you add a pull request with your changes into this branch? That would be the proper way + our tooling for reviewing changes would work well with that. Normally, you just make the changes on your own fork, and then do a pull request into this branch.

isaacguerreir commented 8 months ago

It would be interesting now that we're using generics to have a factory to decide which parser to use based on the file content, so the developer doesn't need to choose what parser to use.

Koeng101 commented 8 months ago

It would be interesting now that we're using generics to have a factory to decide which parser to use based on the file content, so the developer doesn't need to choose what parser to use.

Due to how generics work in Go, I'm not sure that you actually can do this. Basically, if the output signature is Parser[Data, Header], you can't return Parser[fasta.Name, fasta.Header].

isaacguerreir commented 8 months ago

my mistake, for some reason reading your explanation I thought it would be possible to pass generic types as outputs, but taking a better look at the documentation it seems undoable to use Golang generics like that. It would be possible if the function returns a common structure, which is not possible with the current implementation.

Koeng101 commented 8 months ago

seems undoable to use Golang generics like that. It would be possible if the function returns a common structure, which

Yes, we don't have a common structure between all of the different parsers, and I think this is a good thing. Makes each data type very specific and concise.