codereverser / casparser

Parser for Consolidated Account Statements (CAS) generated from CAMS/Karvy/Kfintech
MIT License
138 stars 63 forks source link

Duplicated transaction #45

Closed 0xKD closed 2 years ago

0xKD commented 3 years ago

This is a bug in pdfminer/mupdf but I thought It would be useful to document (since the implications are somewhat critical if you rely on the output of casparser).

If you have pages that like look this across page boundaries, it seems to count the transaction at start of page two in the previous page as well. For me, it counts the *** Stamp Duty*** transaction at the start of the second page twice (once as part of the previous page 4, and again for the actual first time it is encountered - in page 5).

parsingbug

My guess is the mediabox (used by pdfminer to determine page boundaries) of the page is larger than necessary and extends into the second one.

codereverser commented 3 years ago

Can you please post the partial json for the scheme around 5 Jun? haven't faced such an issue yet but yeah, it could be possible if the pdf boundaries extend/overlap.

0xKD commented 3 years ago
{
    "scheme": "Quantum India ESG Equity Fund - Direct Plan Growth",
    "advisor": null,
    "rta_code": "123ESGPG",
    "type": "EQUITY",
    "rta": "KFINTECH",
    "isin": "INF082J01382",
    "amfi": "147372",
    "open": "0.000",
    "close": "8298.919",
    "close_calculated": "8298.919",
    "valuation":
    {
        "date": "2021-07-19",
        "value": "133280.64",
        "nav": "16.06"
    },
    "transactions":
    [
        {
            "date": "2021-06-05",
            "description": "Systematic Investment (1/932)",
            "amount": "65531.72",
            "units": "4198.060",
            "nav": "15.61",
            "balance": "4198.060",
            "type": "PURCHASE_SIP",
            "dividend_rate": null
        },
        {
            "date": "2021-06-05",
            "description": "*** Stamp Duty ***",
            "amount": "3.28",
            "units": null,
            "nav": null,
            "balance": "4198.060",
            "type": "STAMP_DUTY_TAX",
            "dividend_rate": null
        },
        {
            "date": "2021-06-05",
            "description": "*** Stamp Duty ***",
            "amount": "3.28",
            "units": null,
            "nav": null,
            "balance": "4198.060",
            "type": "STAMP_DUTY_TAX",
            "dividend_rate": null
        },
        {
            "date": "2021-07-05",
            "description": "Systematic Investment (2/932)",
            "amount": "65531.72",
            "units": "4100.859",
            "nav": "15.98",
            "balance": "8298.919",
            "type": "PURCHASE_SIP",
            "dividend_rate": null
        },
        {
            "date": "2021-07-05",
            "description": "*** Stamp Duty ***",
            "amount": "3.28",
            "units": null,
            "nav": null,
            "balance": "8298.919",
            "type": "STAMP_DUTY_TAX",
            "dividend_rate": null
        }
    ]
}
codereverser commented 3 years ago

Oh. this is indeed weird. not sure how to resolve it. 😕

Simple duplicate check won't work since it is possible to have multiple stamp duty transactions on the same day, if there are multiple purchases / switch in transactions

0xKD commented 3 years ago

Yeah doing it in casparser isn't reliable.