NTMC-Community / MatchZoo

Facilitating the design, comparison and sharing of deep text matching models.
Apache License 2.0
3.82k stars 898 forks source link

Is there any method to change csv to data_pack? #835

Closed saekomdalkom closed 3 years ago

saekomdalkom commented 3 years ago

Describe the Question

Please provide a clear and concise description of what the question is.

Describe your attempts

You may also provide a Minimal, Complete, and Verifiable example you tried as a workaround, or StackOverflow solution that you have walked through. (e.g. cosmic radiation).

In addition, figure out your MatchZoo version by running import matchzoo; matchzoo.__version__. If this gives you an error, then you're probably using 1.0, and 1.0 is no longer supported. Then attach the corresponding label on the issue.

Hello, I think this might be quite silly question,
but it seems that I have to make new data_pack to use my own data for matchzoo.
Is there any method that convert csv file to data_pack automatically?
I guess I need to make it by myself. Is that right?

bwanglzu commented 3 years ago

hi @saekomdalkom no worries, please take a look at the example code here.

        >>> left = [
        ...     ['qid1', 'query 1'],
        ...     ['qid2', 'query 2']
        ... ]
        >>> right = [
        ...     ['did1', 'document 1'],
        ...     ['did2', 'document 2']
        ... ]
        >>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]]
        >>> relation_df = pd.DataFrame(relation)
        >>> left = pd.DataFrame(left)
        >>> right = pd.DataFrame(right)
        >>> dp = DataPack(
        ...     relation=relation_df,
        ...     left=left,
        ...     right=right,
        ... )
        >>> len(dp)
        2

As you can see, a DataPack is a wrapper over three pandas dataframes: query df (qid and query text), doc df (doc id and doc text) and a relation df. You can use pandas interface to create three dataframes and load into datapack, pretty easy.

saekomdalkom commented 3 years ago

hi @saekomdalkom no worries, please take a look at the example code here.

        >>> left = [
        ...     ['qid1', 'query 1'],
        ...     ['qid2', 'query 2']
        ... ]
        >>> right = [
        ...     ['did1', 'document 1'],
        ...     ['did2', 'document 2']
        ... ]
        >>> relation = [['qid1', 'did1', 1], ['qid2', 'did2', 1]]
        >>> relation_df = pd.DataFrame(relation)
        >>> left = pd.DataFrame(left)
        >>> right = pd.DataFrame(right)
        >>> dp = DataPack(
        ...     relation=relation_df,
        ...     left=left,
        ...     right=right,
        ... )
        >>> len(dp)
        2

As you can see, a DataPack is a wrapper over three pandas dataframes: query df (qid and query text), doc df (doc id and doc text) and a relation df. You can use pandas interface to create three dataframes and load into datapack, pretty easy.

Thank you :)