(1d) GFF download *REALLY* slow from ref seq / 9 annotations

GMOD / Apollo

Genome annotation editor with a Java Server backend and a Javascript client that runs in a web browser as a JBrowse plugin.

http://genomearchitect.readthedocs.io/

Other

128 stars 85 forks source link

(1d) GFF download REALLY slow from ref seq / 9 annotations #274

Closed nathandunn closed 8 years ago

nathandunn commented 9 years ago

Fixed this, but then I got another error, not being able to do right-clicks . . so I didn't check this fix in.

Cannot invoke method replaceAll() on null object. Stacktrace follows: java.lang.NullPointerException: Cannot invoke method replaceAll() on null object at org.bbop.apollo.Gff3HandlerService.encodeString(Gff3HandlerService.groovy:343) at org.bbop.apollo.Gff3HandlerService.extractAttributes(Gff3HandlerService.groovy:321) at org.bbop.apollo.Gff3HandlerService.convertToEntry(Gff3HandlerService.groovy:179) at org.bbop.apollo.Gff3HandlerService.convertToEntry(Gff3HandlerService.groovy:185) at org.bbop.apollo.Gff3HandlerService.convertToEntry(Gff3HandlerService.groovy:158) at org.bbop.apollo.Gff3HandlerService.writeFeature(Gff3HandlerService.groovy:114) at org.bbop.apollo.Gff3HandlerService.writeFeatures(Gff3HandlerService.groovy:80) at org.bbop.apollo.Gff3HandlerService.writeFeaturesToText(Gff3HandlerService.groovy:60) at org.bbop.apollo.SequenceController.$tt__exportSequences(SequenceController.groovy:116) at grails.plugin.cache.web.filter.PageFragmentCachingFilter.doFilter(PageFragmentCachingFilter.java:198) at grails.plugin.cache.web.filter.AbstractFilter.doFilter(AbstractFilter.java:63)

deepakunni3 commented 9 years ago

@nathandunn Yes, even I noticed the slow download. Perhaps we can review the code tomorrow and see if it can be sped up.

nathandunn commented 9 years ago

Yeah, I think that is a good idea.

nathandunn commented 9 years ago

Takes about 1 minute to extricate 9 annotations . . . I made a small change, but the real culprit is simply doing a recursive SQL query:

 convertToEntry(...){
// ... 
    gffEntries.add(entry);
    for (Feature child : featureRelationshipService.getChildren(feature)) {
        if (child instanceof CDS) {
            convertToEntry(writeObject, (CDS) child, source, gffEntries);
        } else {
            convertToEntry(writeObject, child, source, gffEntries);
        }
    }
//.. 
}

nathandunn commented 9 years ago

This is partly because we look for children when the feature type makes no sense (e.g., CDS/Exon, etc.), which is a lot of our features. Should only look in the cases where we have a code-able transcript or gene.

nathandunn commented 9 years ago

This is going to need to be a 1-2 day refactor . . with each Query as a HashMap<String,Feature> where String is a lookup string:

fetch all top-level (no parent) properties with a key of self then first-children list (uniquename1::uniquename2::etc.) uniquename // gets genes and pseudogenes and other types
fetch all mid-level properties (have parents and children), key self, parents and then children // transcripts and some other types
fetch all bottom-level properties (no children), key is this-string,parent-string // exon, cds, splice sites, etc.
fetch all non-relationship types (not sure if exist, but might be a type like this) // key would just be self

so this is 4 queries X # sequences regardless of the # of annotations

Fetches should go into "view" objects instead of domain objects.

For each "parent" we build up a structure of children.

It could be that the keys are just List<> and we use a nice "uniquename" identifier to faciliate a quick and proper lookup based on the uniquename only.

nathandunn commented 9 years ago

After further testing, I don't think that this was a bad as I thought.

nathandunn commented 9 years ago

@deepakunni3 Just an FYI.

After further testing, I think the problem is that H2 (what the dev environment uses) is REALLY slow for this type of operation. However, it seems to work great against PostgreSQL.

nathandunn commented 9 years ago

Sorry, didn't mean to close. It can still be optimized.

cmdcolin commented 9 years ago

Just updating this again, but GFF3 output is still really slow.

Here are several runs on the Pythium ultimum data (341 features)

real    3m51.849s

real    4m28.702s

The procedure to calculate the phase at the gff3 service level is particularly expensive

nathandunn commented 9 years ago

that's pretty slow and those should be easy fixes (all / every is slow)

cmdcolin commented 8 years ago

As mentioned at meeting, this is still pretty slow. It's about 200% faster since my last report due to an optimization to the FeatureRelationshipService::getParentForFeature, so it is taking about 2 minutes instead of the old 4 minutes on the 341 genes in pythium ultimum sample data.

nathandunn commented 8 years ago

Some real quick notes:

To export 18 features takes 1.24 seconds, which is not bad. As @cmdcolin noted earlier, it does a lot of individual and un-necessary feature queries:

nathandunn commented 8 years ago

Should note . . these are select X from feature where feature.id = ?

nathandunn commented 8 years ago

For more features (1800) we have this:

nathandunn commented 8 years ago

Which of course gets exacerbated:

However, digging a bit deeper, looks like the slowdown is in individual queries . . e.g. getComments does a request on a single feature for a single feature property.

I think that the solution is in "extractAttributes" to pull out a Map's of features and their comments and then pull off of those individually, similar to what is done in TranscriptService. This way its just a single query to populate "comments".

cmdcolin commented 8 years ago

This is a random note but fasta needs optimization too. it is slow for different reasons if I recall, namely, reading sequences into and out of database

nathandunn commented 8 years ago

Thanks. Could you open a different issue for this with any details?

Nathan

On Feb 3, 2016, at 7:47 AM, Colin Diesh notifications@github.com wrote:

This is a random note but fasta needs optimization too. it is slow for different reasons if I recall, namely, reading sequences into and out of database

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/274#issuecomment-179302227.

monicacecilia commented 8 years ago

:dancer: it is faster.

GMOD / Apollo

(1d) GFF download *REALLY* slow from ref seq / 9 annotations #274

(1d) GFF download REALLY slow from ref seq / 9 annotations #274