dotnetbio / bio

Bioinformatics library for .NET
Apache License 2.0
143 stars 49 forks source link

BAMParser Extensibility Enhancements #8

Closed KRBeut closed 3 years ago

KRBeut commented 8 years ago

Created a generic version of the BAMParser class that will allow for easier extension of the class by developers using the .NET Bio library. Also, to maintain compatibility with existing code, there is also a non-generic version of the BAMParser class that inherits from the generic version of the class with the template parameter set to SAMAlignedSequence.

If this change is except, I would also like to suggest that we allow the BAMParser class accept a factory method as a constructor parameter. Such a feature would further improve the extensibility of the class by inverting the control of object creation.

evolvedmicrobe commented 8 years ago

Hi KRBeut,

Thanks for submitting a pull request! We'd love to take a look and incorporate any changes. Just a few quick comments from my end.

Have you seen the BAMParserExtensions in Shared\Extensions\BAMParserExtensions? These contain a range of utilities to Parse BAM files and I am not sure if this would play well with these changes. For example, the ParseRange method makes a call to only parse the reads from a Start/End on a given chromosome. However I believe if a user were to create a generic version of BAMParser it would not benefit from these extensions. I see a problem in that the pull request makes the Parse method virtual, but does not alter any or overwrite any of the extension methods which would allow users to query a subset of all the BAM files. I think whatever changes we implement should allow all generic classes to use the extension methods (and we also need to implement these so that they return IEnumerable's rather than SequenceAlignmentMaps).

The second issue is it is not clear to me how making this generic helps the end user create special derived classes. It seems like the greatest asset would be to allow the users to decode the BAM entries in a special way so that they can store certain information in defined fields instead of in dictionaries. However, this would require overriding the GetAlignedSequence method, which is still kept private here. Can I ask how this helped you create a special class derived from SAMAlignedSequence? (Hoping to understand the rational a bit more).

Cheers, N

KRBeut commented 8 years ago

I have not looked at the BAMParserExtensions yet. I will look at them tomorrow and comment more specifically or just submit additional changes to the pull request as you may already be suggesting.

I think the generic class will help users (at very least me!) create special derived classes. Aside from the pool pattern mentioned above, I originally wanted to show an example of how these changes will improve extensability by re-implementing the PacBioCCSBamReader as a class that inherits from the generic version of BAMParser. However, it appears that the PacBioCCSRead class does not inherit from SAMAlignedRead (even though it is an SAMAlignedRead and probably should inherit from SAMAlignedRead) and the PacBioCCSBamReader class does not inherit from BAMParser (even though it is a BAMParser and probably should inherit from BAMParser).

I would have thought that PacBioCCSBamReader would inherit from BAMParser. However, achieving the same api and functionality with the non-generic BAMReader class may not have been possible or easy. With the generic BAMParser class, this would be very simple - if PacBioCCSRead inherited from SAMAlignedRead as it probably should.

evolvedmicrobe commented 8 years ago

On memory management, I would say be very cautious about trying to outsmart the garbage collector unless you have strong reasons and profiling data to show you can do better. I have learned this the hard way a few times.

In particular, regarding whether or not it is a good idea to repeatedly create a new SAMAlignedSequence or re-use the same memory, I would keep the following in mind:

1 - GC's are great for short-lived objects - If the object is quickly created and consumed, then it is a short lived object for which garbage collectors can kick-butt performance wise. Although it depends on the heap manager, memory allocations in C through malloc can be much more expensive than allocations with a garbage collector. A malloc call can involve a context switch, interaction with a heap manager and all kinds of other stuff. In C#, memory allocations are typically just a fast pointer increment, and the code to clear the memory is likely as fast or faster than what your C# would be doing.

2 - Memory locality - If you keep alive a single sequence object for parsing, it will eventually get promoted by the garbage collector to an older generation. This means it will be further away in memory from all the new objects you are creating in generation 0, and that references from generation 0 will be written to the older generations, which can slow down garbage collection. If everything is in generation 0 and there are no live references to any of those objects, then it is super-fast to reclaim that memory.

Anyway, in general (and particularly on Linux) avoiding memory allocations is good, but I'd encourage you to run some profiling results to verify how much extra performance you are getting for the hassle. If you are using Visual Studio, the profiling tools there are awesome if you felt like sharing.

Regarding PacBioCCSRead - I actually wanted this to implement ISequence and not inherit from SAMAlignedSequence for this. Although both are stored in BAM files, CCS reads in BAM files don't have any alignment information associated with them, so it didn't make sense to think of them as "aligned sequences." Also I didn't want a bloated SAMAlignedSequence object since I would be holding a lot in memory. The PacBioCCSRead class has very few reference types, so takes up less memory and requires less time per class for the GC to track down its referenced objects than if it were derived from SAMAlignedSequence. Granted, although the parser is inefficient because it wraps the BAM parser, the real problem with this is all the tag data goes from binary to string and then back to binary (and some of the decoded information is just discarded, as many fields are always */0). However, to properly replace the decoding of the binary information I would have needed to change the decoding function within BAMParser, which given that it was fast enough and easy to write I didn't think was worth the effort. Though I still think that can be a big performance win for known filetypes, that isn't being discussed here.

Anyway, all of which is to say I am not opposed to the changes and we're a community project, so whatever helps someone and doesn't hurt anyone else is good to go in my book! But I just wanted to suggest you look at the extension methods to see if you can incorporate those into the generics, as you say these files are gigantic and I think it would be nice to preserve the random access query abilities across all generic types of BAMParser.

Cheers, Nigel

KRBeut commented 8 years ago

Hi evolvedmicrobe,

The memory management guidance from here:

https://msdn.microsoft.com/en-us/library/ms973837.aspx

is the initial reason I wanted to use pools and similar patterns. Specifically, the first piece of advice in the "Too Many Allocations" section.

Here is the to do list for this pull request based on our discussion so far:

1.) work with the BAMParserExtensions to see how they integrate with the suggestions in the current pull request.

2.) Add performance profiling for different implementations of GetNewAlignedSequence

3.) Implement an example of extending the BAMParser class by creating a

public class IonTorrentBAMParser : BAMParser where T : IonTorrentRead, new()

this class will return strongly typed IonTorrentRead objects that have the flow space values parsed out of the tags and ready for strongly-typed consumption.

Does this cover the list of feedback so far? Does anyone else have input they'd like to share?

Can you advise me on how to proceed with this pull request? Basically, I need to know how to add the things listed above to the pull request. Is there a way to directly add more changes to the pull request, or is there some other mechanism I need to use? I guess another option is to accept the current pull request, then submit a second pull request with these changes. Please advise.

-KRBeut

evolvedmicrobe commented 8 years ago

Hi KRBeut,

Apologies for the slow response, we're (like everyone) very busy, but I am really excited about your work to extend the BAM parser to handle new data types.

In any event, all three of those would be great. Though if the performance profiling is not easy, please do not worry about it at this stage*. Getting Git and Profiling and Meaningfully-useful-code all right on the first pull request would be difficult, and the third item matters far more than the first two.

Adapting the BAM parser extensions would be good, and having the actual implementation of your specific and novel parser would be really fantastic! A change that enables future useful work is fine, but a change that not only enables but comes with useful additions is far superior (such as a parser for a new format!). Perhaps maybe at a comment to the source as well to indicate why BAMParser allows you to overide the method and call the generic type?

Regarding git... We have never been version control fascists, but generally the things people aim for are a "clean" history. This means optimizing for fewer total commits and fewer merge events, with good documentation associated with each "git event". To learn more about git version control generally, I would recommend reading a quick website or book. For example, with an email address you can get this nice "Git Succinctly" book that I can verify can be read while riding the bus to work (https://www.syncfusion.com/resources/techportal/ebooks/git). This should cover most typical conventions. The central things to know for pull requests are how to branch and rebase. Borrowing some instructions from a typical workflow - you should likely branch (rather than use "master") for a new feature and then "squash" repeated commits so that your pull would be one commit on one new branch that was rebased to the current HEAD of the master. The commands do to this would look a bit like the following:

Add your GitHub fork as a remote:    

  git remote add my-remote git@github.com:pb-myusername/foo.git    

You can work on a feature branch:    

  git checkout -b my-feature origin/master
  git fetch origin
  git rebase origin/master
  ... (Edit files.)
  git fetch origin
  git rebase origin/master
  git push my-remote my-feature

However, as I said, we can clean up the git history before merging, for now only worry about producing useful code, and then just commit on top of your branch and repush to github. You should not need to generate a new pull request after committing any changes and pushing to the server.

Sorry again for the delay.

Cheers, Nigel

markjulmar commented 8 years ago

@evolvedmicrobe

I think it's a bit confusing. Normally, that is reversed; e.g. you have a base class with the name and then have the generic type derive from it. See types like IEnumerable and IEnumerable<T> as an example.

I'd suggest we change the base type here to be something other than BAMParser, although I'm not a huge fan of Base suffixes either, it seems like it might be ok here.. thoughts?

evolvedmicrobe commented 8 years ago

@KRBeut not sure where we left off on this, are you still interested in merging this?

markjulmar commented 3 years ago

Closing due to inactivity.