CGnal / cgnal-core

A Python library of data structures optimized for machine learning tasks
MIT License
8 stars 2 forks source link

MongoArchiver issues #6

Open NicolaDonelli opened 3 years ago

NicolaDonelli commented 3 years ago

There are a few possible issues with cgnal.data.layer.mongo.archivers.MongoArchiver class.

  1. the signature of "retrieve" method is not compatible with the one of Archiver, since Archiver.retrieve requires *args and **kwargs. I'd propose to change the signature of the ABC making it compatible with our requirements for Pandas and Mongo Archivers
  2. to make archiveOne and archiveMany signature coherent, so that archive could have a simpler output type, wouldn't it be better for this method to return self.archiveMany([obj]) (like we did with PandasArchiver)?
  3. 'archive' method's output type is not consistent with its' ancestor's return type (that should be 'MongoArchiver'). I'd propose to fix this returning self instead of those UpdateResults

@deusebio @memoclaudio @aiola

deusebio commented 3 years ago

Being the "archive" a side-effect probably makes also sense to return None (and this would also give the same output between archiveOne and archiveMany, which should always be None). Another possibility (but I would do it later in the future when we integrate Mondas nicely in the package) would be to return an Either, in order to also handle errors functionally.

deusebio commented 3 years ago

about the arguments of retrieve it is actually quite a pain. Unfortunately we don't have an abstraction for "Query" yet, that is an object that represents the filters one wants to apply when querying persistence layers that is agnostic from the actual implementation (SQL wants strings, Pandas we could use functions, MongoDB we use dicts). I agree, this is really bothering me, but unfortunately, I don't quite have an elegant solution for handling this.

What we started to do in CIB News is to extend from the Archiver, implement methods that do the querying and call the retrieve with the generic output, e.g.

class DocumentsArchiver(Archiver):
    @abstractmethod
    def byPublishDate(publishDate: str): ...

class MongoDocumentsArchiver(MongoArchiver, DocumentsArchiver):

    def  byPublishDate(publishDate: str):
        return self.retrieve({"publishDate": publishDate})

class PandasDocumentsArchiver(PandasArchiver, DocumentsArchiver):

    def  byPublishDate(publishDate: str):
        return self.retrieve(lambda df: df[df["publishDate"] == publishDate])

In the code, we never use retrieve directly, but we generally use the custom retrieving methods, such as byPublishDate

But maybe, here, @nromeoarena can advise what it is best to do

memoclaudio commented 3 years ago

In my opinion, we should avoid function returning Node. For functions like archive, I do agree with @deusebio that Either could be the best solution. It will at least return the status of the operation in order to track errors.

memoclaudio commented 3 years ago

As an input object for the archive, I also agree with @deusebio and we should think more about a "Query" abstract object that will implement query in a specific language like QueryMongo, QueryPandas, QuerySQL. We can then create a factory that actually builds specific queries instead of putting the logic of a query inside the Archiver. As example

class ClientQueryBuilder(MongoQueryBuilder): def retrieveByDate(data :str) -> MongoQuery: def retrieveClientByRegion(region: str) -> MongoQuery:

the archiver will then contain only the archive method def archive(query: Query) -> Either and def retrieve(query: Query) -> Either, according to this refactoring the archiver will only contain the "connection to the database" logic.

Moreover if we centralize the querybulder/queryfactory we have a centralized point where we actually build the query and we do not need to write the same query in different part of the code without touching the archiver that will be finally independent form business logic.