Add fasta2 package - Githubissues

folago commented 1 year ago

This is exploratory work and does not have to be merged in this state.

There are for sure some functionalities missing as well as docs and examples, but this more or less the direction I would go about a fasta package.

I called it fasta2 just to have it side by side and maybe try both API, but in case some of this package is liked we can decide how much of it to keep and integrate or discard.

Here is a list of things in no particular order I do not like about the status quo that I'm trying to fix in the fasta2 package:

fasta.Fasta stutters so I would rename it to fasta.Record also the Name field should be called Identifier as in the fastq package or Header as mentioned in the BioStar handbook.

I dot understand the gzipReaderFn, openFn, and buildFn functions.
Build() does not sound vary idiomatic for serialization in bytes I used Bytes() as a method of Record.
Test data should be in the testdata directory.
ParseConcurrent() is IMHO the nicest parser implementation, so I extracted the state in a struct and added some methods to get it to be the only parsing code to maintain.
As for the rest of the concurrent functions it is frowned upon to have channels in the API, better to provide a clean API that is easy to use with channels.

Some other thoughts:

Does it make sense to have a Parser interface in the io package and have implementation of it in each sub package?
Is it necessary to have the gzip functions? I understand that it can be convenient but then should we add some other compression algorithms?

Please let me know what you think.

Koeng101 commented 1 year ago

Generally, I really like the idea of changing the name of fasta.Fasta to fasta.Record. I think we should implement this in fasta.

I don't really understand those "gzipReaderFn, openFn, and buildFn functions" either - if there are two of us, probably others wouldn't either, and this would decrease readability. I think we should remove them.

Does it make sense to have a Parser interface in the io package and have implementation of it in each sub package?

This is something I'd really like to do! Though I think it would probably be a generic of some kind? Basically, I think for every io (genbank, fasta, fastq, slow5, sam, pileup) we should implement the following standard functions:

type Parser[T genbank | fasta | fastq | slow5 | sam | pileup, TH any] struct { // TH should have their respective headers
   Header() (TH, error)
   Next() (T, error)
   MaxLineCount() int64
   Reset() error
}

or maybe something like that - I am not sure what is most readable. Of course, the records / alignments / reads themselves would implement the Write interface, which honestly I think we should just standardize to get into a io.Writer. Generics should at least allow us to build on top of that io.Writer in a way that we aren't rebuilding the same function six times.

Thoughts? happy to help prototype this this weekend.

Is it necessary to have the gzip functions? I understand that it can be convenient but then should we add some other compression algorithms?

GZIP is so widely used by fastq at least, a reasonable default would be to read gz files, not raw text, if it wasn't for the fact that it wouldn't make as much sense to programmers (but make perfect sense to end users - you pretty much only push around sequencing as fastq).

Still, it think this should probably be implemented on top of or somewhere else. It's kinda ridiculous that we have to implement functions like reading gz 6 times. If we had this in one place, it'd be a lot easier to build functions that could compress in may different ways.

For the io.Writer stuff, I have a small PR doing that here - https://github.com/TimothyStiles/poly/pull/337/files . There is one important bug fix in there too - https://github.com/TimothyStiles/poly/blob/5cde1b2232331be07625786b6201c7c0dce1717e/io/fastq/fastq.go#L179-L185 - which I have no idea why it helped but it did, but only for my very large files. I think it has something to do with memory, so it was hard to reproduce in an individual test.

carreter commented 12 months ago

@Koeng101 Is there anything in this PR that wasn't included in #339 ?

Koeng101 commented 12 months ago

Everything here is implemented in #339 I think, but the start of #339 was because of this PR. I'd like to give some credit somehow, even if there isn't significant code overlap

TimothyStiles commented 12 months ago

We should be able to do that @Koeng101. We just need to add @folago's GitHub email in a certain way to the bottom of the PR when we merge it.

Koeng101 commented 12 months ago

Aight, we will include that then. Hope that works @folago ! I appreciate your contributions!

folago commented 11 months ago

Hey, sure all good on my side. Thanks a lot 🙂

bebop / poly

Add fasta2 package #337