FooSoft / zero-epwing

Sane data exporter for an insane dictionary format.
https://foosoft.net/projects/zero-epwing/
MIT License
99 stars 17 forks source link

When run on kenkyuusha certain headers are incomplete #5

Open rtega opened 6 years ago

rtega commented 6 years ago

The heading of たしなむ is "heading": "嗜む" while it should be "たしなむ【嗜む】"

rtega commented 6 years ago

I added the following lines in line 143 of book.c: if(strstr(result,"嗜む")) { printf("boef: %s %i %i\n",result,position->page,position->offset); } which yields the following result: boef: tashinamu <たしなむ【嗜む】> 30827 984 boef: たしなむ【嗜む】 <..> 138094 1506

boef: たしなむ【嗜む】 33548 130 boef: たしなむ【嗜む】 <..> 138094 1506

boef: 嗜む 38028 1326 boef: たしなむ【嗜む】 <..> 138094 1506 Basically whats happening is that there are three headers in the dictionary which all refer to the same article. Only the last header is exported.

rtega commented 6 years ago

Basically, things go wrong in book_undupe(book); We need to be smarter about what we are removing.

rtega commented 6 years ago

I would propose to save the heading with the largest content when removing in book_undupe(book). I don't understand your code at first view. Could you have a look at it?

rtega commented 6 years ago

I changed the undupe code with this quicksort and removeduplicates. The resulting file is a bit smaller but it seems to work as it should. `void swap(Book_Entry a, Book_Entry b) { Book_Entry t = a; a = b; b = t; }

int partition_entries(Book_Entry arr[], int low, int high) { Book_Entry * pivot = &arr[high]; // pivot int i = (low - 1); // Index of smaller element

for (int j = low; j <= high- 1; j++)
{
    // If current element is smaller than or
    // equal to pivot
    if (arr[j].text.page < pivot->text.page)
    {
        i++;    // increment index of smaller element
        swap(&arr[i], &arr[j]);
    }
if(arr[j].text.page == pivot->text.page)
{
    if(arr[j].text.offset < pivot->text.offset)
    {
        i++;
        swap(&arr[i],&arr[j]);
        if(arr[j].text.offset == pivot->text.offset)
        {
            if(strlen(arr[j].heading.text) <= strlen(pivot->heading.text))
            {
                i++;
                swap(&arr[i],&arr[j]);
            }
        }
    }
}
}
swap(&arr[i + 1], &arr[high]);
return (i + 1);

}

/ The main function that implements QuickSort arr[] --> Array to be sorted, low --> Starting index, high --> Ending index / void quickSort_entries(Book_Entry arr[], int low, int high) { if (low < high) { / pi is partitioning index, arr[p] is now at right place / int pi = partition_entries(arr, low, high);

    // Separately sort elements before
    // partition and after partition
    quickSort_entries(arr, low, pi - 1);
    quickSort_entries(arr, pi + 1, high);
}

}

int removeDuplicates_subbook(Book_Subbook subbook) { int n = subbook->entry_count; Book_Entry arr = subbook->entries; // Return, if array is empty // or contains a single element if (n==0 || n==1) return n;

Book_Entry * temp = malloc(n*sizeof(Book_Entry));

// Start traversing elements
int j = 0;
for (int i=0; i<n-1; i++)

    // If current element is not equal
    // to next element then store that
    // current element
    if ((arr[i].text.page != arr[i+1].text.page) || (arr[i].text.offset != arr[i+1].text.offset))
        temp[j++] = arr[i];

// Store the last element as whether
// it is unique or repeated, it hasn't
// stored previously
temp[j++] = arr[n-1];

// Modify original array
for (int i=0; i<j; i++)
    arr[i] = temp[i];

subbook->entry_count = j;
free(temp);
return j;

}

static void subbook_undupe(Book_Subbook* subbook) { quickSort_entries(subbook->entries,0,subbook->entry_count -1); removeDuplicates_subbook(subbook); `

rtega commented 6 years ago

It crashes on gakken though.

rtega commented 6 years ago

And doesn't work as it should. Working on an updated version.

FooSoft commented 6 years ago

I think the easiest fix is just to check lengths when looking for dupes. If there is a dupe with a longer header length, swap it with the current entry and delete the dupe. You shouldn't have to sort anything.

That being said, I'm not sure you actually want to use headers for anything. All of that information can be found in the entry text, and you are going to have to parse all of that stuff out with regex anyway. Honestly, if anything, this made me wonder if I should even be exporting the headers out of zero-epwing as AFAIK they are just some weird artifact of the EPWING format.

rtega commented 6 years ago

For reference articles you don't have a header in the entry text itself: "heading": "¶両三日 <りょう2【両】>", "text": "・両三日 two or three days; a couple of days\n" I guess you really want to keep the info in the heading in that case. Take the example of 普通高等学校: "heading": "¶普通高等学校 <こうとうがっこう【高等学校】>", "text": "普通高等学校 a general [an ordinary, an academic] high school.\nこうとうかん【高等官】 {{w_46695}}(k{{n_41528}}t{{n_41528}}kan)\n" The heading is referring to 高等学校 while the text is referring to 高等官. You want to keep the info in the heading I think.

Looking at your code to remove dupes, I don't see how you can get at the entry which you are comparing from a Page-pointer solely.