Add option to set has_header to fix one-line csv bug

ThomasdenH commented 3 years ago

If a csv file is parsed without a header that contains only one entry, the current importer would interpret it as a header. This PR gives an additional setting to circumvent the automatic detection of a header.

dnicolodi commented 3 years ago

@ThomasdenH I would appreciate if you could have a look at #81 and see if that approach works for your CSV importers. It should allow to easily cover all use cases the current CSV importer covers and many more. For an example see the newcsvbank.py importer in the examples. In particular, the new CSV importer base class has a skiplines class variable to define how many lines to skip at the beginning of the file, and a names class variable to define whether the first row contains column names or not.

ThomasdenH commented 3 years ago

Thanks! The new importer fits my needs well.

dnicolodi commented 3 years ago

Thank you for having a look. I didn't mean that to be a replacement for fixing the existing importer and thus to this PR. However, I tend toward making csvbase.Importer the recommended solution, deprecate csv.Importer and remove it after the first release of beangulp. Do you have any comment on the extension interface of csvbase.Importer? I would appreciate some feedback before I merge it into beangulp.

ThomasdenH commented 3 years ago

I couldn't really figure out how to use the filing and identify mixins, but that is probably due to my limited experience with Python.
I use one csv format where amounts use a comma as a decimal separator. It was very easy to add a column type to fix that, but it could also be a simple addition to the Amount type.
The old formatter filtered out 0 value transactions (I think). This one doesn't, which is probably better, but I would still like some way to filter them out. For me it would be enough if finalize accepted None as a return type to indicate that a transaction should be removed.

dnicolodi commented 3 years ago

I couldn't really figure out how to use the filing and identify mixins, but that is probably due to my limited experience with Python.

This is by design. The mixins are considered an awkward API are are going to be phased out with the switch to the new Importer interface. I find it much easier to implement the identify() and the archiving related methods directly in the importer. If your experience is different I would like to hear about it.

I use one csv format where amounts use a comma as a decimal separator. It was very easy to add a column type to fix that, but it could also be a simple addition to the Amount type.

I thought about it, but I failed in coming up with a general API that does not involve passing a parse function to the Column definition and if you do that you can as well define a new Column type:

class Amount(csvbase.Amount):
    def parse(self, value):
        return super().parse(value.replace(',', '.')

Do you have in idea for a suitable API?

The old formatter filtered out 0 value transactions (I think). This one doesn't, which is probably better, but I would still like some way to filter them out.

I think the old CSV importer did some filtering, indeed, but why would be transactions with zero amount be in the bank statements (or other source) if they do not contain any interesting information? I thus thought that the filtering is an anti-feature. I think that filtering is more clearly implemented as a second step:

class Importer(csvbase.Importer):
    def extract(self, filepath, existing):
        entries = []
        for entry in super().extract(filepath, existing):
            if isinstance(entry, data.Transaction) and entry.postings[0].units.number == ZERO:
                continue
            entries.append(entry)
    return entries

but I may be convinced otherwise.

ThomasdenH commented 3 years ago

I find it much easier to implement the identify() and the archiving related methods directly in the importer. If your experience is different I would like to hear about it.

Maybe some convenient function somewhere to make checking the mime-type doable in one line? Other than that, overriding the methods was indeed easier.

Do you have in idea for a suitable API?

Your example was pretty much how I handled it. My specific case could be solved with a decimal_separator=',' in the csvbase.Amount constructor, but not sure if that's worth it.

I think that filtering is more clearly implemented as a second step:

In my case this is a good solution. In the general case there may be situations where the transactions for some rows can't even be built. Is that even possible, if some attributes don't appear for a row? In that case it may be required to filter rows before processing the transactions.

dnicolodi commented 3 years ago

Maybe some convenient function somewhere to make checking the mime-type doable in one line?

Currently it is three lines:

mimetype, encoding = mimetypes.guess_type(filepath)
if mimetype == 'text/csv':
    return True

which, loosing some readability, can be reduced to

return mimetypes.guess_type(filepath)[0] == 'text/csv'

It would be nicer if guess_type() would return a named tuple, but this is not the case, unfortunately.

It could be simplified to something like

return mimetypes.check_type(filepath, 'text/csv')

but I don't know if it add enough convenience to make it worth remembering a new API.

In my case this is a good solution. In the general case there may be situations where the transactions for some rows can't even be built. Is that even possible, if some attributes don't appear for a row? In that case it may be required to filter rows before processing the transactions.

In this case my advice is to override the etract() method entirely. It is pretty straightforward. Alternatively, the read() method can be overwritten:

class Importer(csvbase.Importer):
    def read(filepath):
        for row in super().read(filepath):
            if not row.amount:
                continue
            yield row

This has the drawback that you loose the correct line number information in the transactions metadata but in most applications it does not matter at all (it is not serialized by default and it is there only to provide the correct ordering). I think that in the vast majority of cases a missing filed should result in an error and adding one option to skip rows with missing values would require a way to specify which values are optional, which seems a lot of work for very little gain. The API has been thought too be extensible more than configurable, and I think it works rather well.

dnicolodi commented 3 years ago

Your example was pretty much how I handled it. My specific case could be solved with a decimal_separator=',' in the csvbase.Amount constructor, but not sure if that's worth it.

I added a subs argument to the csvbase.Amount object constructor. It takes a dictionary mapping regular expression patters to replacements strings. The replacements are applied with re.sub() in the order provided before passing the string to decimal.Decimal(). I figure that this is the simplest most generic way to handle it. For example, it allows to handle amounts with currency prefixes: csvbase.Amount(0, subs={'^\\$(.*)': '\\1', ',': ''}) parses strings like $1,000.00.

Thank you very much for your feedback.

ThomasdenH commented 3 years ago

Great! Thank you for the work on this project!

beancount / beangulp

Add option to set has_header to fix one-line csv bug #80