Handling text fragments in the first few rows of a CSV file

RahulSinghYYC commented 4 years ago

Hi I am trying to solve an issue in CSV files where there are explanations and text fragments in the first few rows of a table. How can I handle this issue using CleverCSV.

Any help would be greatly appreciated.

Thanks

GjjvdBurg commented 4 years ago

Hi @RahulSinghYYC, thanks for your question. This depends a bit on whether you're reading the file as a list of list or as a dataframe. If you're using the read_table function, then you can simply remove the unnecessary rows. If you're using the read_dataframe method, then you can pass the skiprows argument to read_dataframe, as this will be passed on to pandas.read_csv.

Hope this helps, please let me know whether this answers your question.

RahulSinghYYC commented 4 years ago

Hi @GjjvdBurg , Thank you for your reply but we get CSV files with multiple lines of text fragments so there is no way of telling how many rows we can skip, it's very dynamic, I am hoping to implement where it can be done by auto-detection. Any help is greatly appreciated. Thanks

GjjvdBurg commented 4 years ago

Hi @RahulSinghYYC, CleverCSV doesn't currently have support for detecting the table area automatically. I know there is some research on this problem (see, e.g. hypoparsr and Pytheas), but there are no ready-to-use Python solutions that I'm aware of.

This is certainly a feature that would not be out of place in CleverCSV, so I suggest we leave this issue open so I/we can hopefully come back to it in the future. Thanks for letting me know there's interest in this :)

lcnittl commented 4 years ago

@RahulSinghYYC For a preliminary workaround, you might want to consider reading the file and trying to determine the first line of the table. This could e.g. be a search for multiple occurrences of the column delimiter (I could envision using regex here) or to look for values that are certainly part of our table (like header names etc) to return the first line number of data. This number can then be used with the methods @GjjvdBurg suggested. But of course, of of this highly depends on your csv data.

Hope I could help!

GjjvdBurg commented 4 years ago

Thanks for offering a suggestion @lcnittl, very nice of you to help! :+1:

Just to offer another work-around: one of the main approaches that CleverCSV takes in detecting the dialect is looking at what we call "patterns" of row lengths. You can probably assume that the text before (and after) the table is not formatted with the same number of delimiters as the table itself. If this is the case, then you can use the row patterns to guess at the start and end of the table.

Take, for example, this dataset of CO2 measurements, which has text before and after the table. We could extract the row patterns as follows:

>>> from clevercsv import Sniffer
>>> from clevercsv.detect_pattern import make_abstraction
>>> from urllib.request import urlopen
>>>
>>> # get the data
>>> req = urlopen('https://cdiac.ess-dive.lbl.gov/ftp/trends/co2/maunaloa.co2')
>>> data = req.read()
>>>
>>> # detect the dialect
>>> dialect = Sniffer().sniff(data)
>>> 
>>> # Compute the row patterns
>>> A = make_abstraction(data, dialect)
>>> row_patterns = A.split('R')

This gives (abbreviating) row_patterns = ['C'] * 13 + ['CDCDCDCDCDCDCDCDCDCDCDCDCDC'] * 53 + ['C'] * 3, so we can be quite sure that we can skip the first 13 rows and remove the last 3. Note that it's not in general guaranteed that the longest/most frequent row pattern is that of the table, but this may be the case in your scenario.

RahulSinghYYC commented 4 years ago

Thank you @GjjvdBurg and @lcnittl for your advice and recommendation , I have been banging my head on this issue for a quite long time with no success , and its a big challenge for us to solve as we get lots of csv files with text fragments on top and bottom from various systems that we have no control over pattern.

I will try your @GjjvdBurg recommendation and see how far I can get.

Thanks

alan-turing-institute / CleverCSV

Handling text fragments in the first few rows of a CSV file #29