aimeos / ai-controller-jobs

Aimeos e-commerce job controllers
https://aimeos.org
GNU Lesser General Public License v3.0
28 stars 17 forks source link

add catalog sitemap job #12

Closed s-diez closed 5 years ago

s-diez commented 5 years ago

See https://github.com/aimeos/ai-controller-jobs/issues/11

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-0.06%) to 95.268% when pulling 34dec176fc577378c88b312ed9f83e659168631d on s-diez:catalog-sitemap into 44d316eef4ec54b2ec4f261270fe7b79014ca962 on aimeos:master.

coveralls commented 5 years ago

Coverage Status

Coverage decreased (-0.06%) to 95.268% when pulling 34dec176fc577378c88b312ed9f83e659168631d on s-diez:catalog-sitemap into 44d316eef4ec54b2ec4f261270fe7b79014ca962 on aimeos:master.

coveralls commented 5 years ago

Coverage Status

Coverage increased (+0.3%) to 95.631% when pulling e398257396d3eab97cd99370a8516e5988839613 on s-diez:catalog-sitemap into 0aab7f217bf23af7ef557477de4ebc1f9096328f on aimeos:master.

s-diez commented 5 years ago

All files are based on the product sitemap job. I mainly changed config paths, description and the search to retrieve the items. I'm not sure if I should generate translations or documentation or how it is done.

While testing I did find some oddities with the logic which creates multiple files based on 'max-items' and 'max-query' in \Aimeos\Controller\Jobs\Product\Export\Sitemap\Standard.php::export(). The test exports 8 items with 5 in the first sitemap file, 3 in the second, as expected, but also another third empty sitemap file. When you change 'max-items' to 6, the first file still only contains 5 items. My class is based on this so the behaviour is the same.

aimeos commented 5 years ago

The last empty file is probably due to creating a content item if there are no results any more: https://github.com/aimeos/ai-controller-jobs/pull/12/files#diff-6e026c9e50bd1a0a33f93cc4651ef1dfR362

If 'max-items' is 6, the first file should contain 6 items and the second 2 (maybe with an empty third file too). Maybe this was still the content of the old file or the configuration is still cached somewhere?

s-diez commented 5 years ago

The last empty file is probably due to creating a content item if there are no results any more: https://github.com/aimeos/ai-controller-jobs/pull/12/files#diff-6e026c9e50bd1a0a33f93cc4651ef1dfR362

I'm not sure about the exact problem but if all items can be exported in one file no additional file is generated. The problem with the extra empty file seems to only occur if we actually need multiple files.

If 'max-items' is 6, the first file should contain 6 items and the second 2 (maybe with an empty third file too). Maybe this was still the content of the old file or the configuration is still cached somewhere?

Some quick debugging confirmed the expected values are used for $maxItems and $maxQuery.

aimeos commented 5 years ago

One thing that came into mind: Is it necessary to split that into two jobs or isn't it better to combine that into one? The later approach would have the advantage for the shop owner to require only one configuration. Are there any disadvantages?

s-diez commented 5 years ago

As long as it would be configurable, if I actually want the category URLs. For example if one dedicated TYPO3 page is used for each category maybe the sitemap of TYPO3 would already handle those URLs.

aimeos commented 5 years ago

I'm not sure about the exact problem but if all items can be exported in one file no additional file is generated. The problem with the extra empty file seems to only occur if we actually need multiple files.

Then, this would be probably the problem: https://github.com/aimeos/ai-controller-jobs/pull/12/files#diff-6e026c9e50bd1a0a33f93cc4651ef1dfR373

aimeos commented 5 years ago

As long as it would be configurable, if I actually want the category URLs. For example if one dedicated TYPO3 page is used for each category maybe the sitemap of TYPO3 would already handle those URLs.

Yeah, you are right, it must be configurable. Another advantage would be much less code. A disadvantage that the "product/export/sitemap" job will not only consist of products anymore but that is a minor issue I think

s-diez commented 5 years ago

Yes I'm not so happy about the pure duplication. In some cases it could be solved by an abstract base class if it should remain seperate.

aimeos commented 5 years ago

Yes I'm not so happy about the pure duplication. In some cases it could be solved by an abstract base class if it should remain seperate.

Abstract base classes have the disadvantage that you can't overwrite them and have to extend all child classes instead :-/

Are you willing to create a PR with the code merged ?

s-diez commented 5 years ago

Are you willing to create a PR with the code merged ?

I would extend export() to select and loop over the categories after all products are added through addItems(). I would check a configuration controller/jobs/product/export/sitemap/catalog. Creating the actual entries would happen in addCatalogItems() which would use controller/jobs/product/export/sitemap/standard/template-catalog-items and controller/jobs/product/export/sitemap/catalog-changefreq. I don't know if the naming of the config options is OK and if this leaves enough customization for any differences between product and catalog URLs. If this is fine for you I can create another pull request.

aimeos commented 5 years ago

Please use sitemap-items-catalog-standard.xml for the template file name to remain consistent to the naming. The rest seems to be OK

s-diez commented 5 years ago

Regarding the urls per file and number of files. Adding to the already discussed problems controller/jobs/product/export/sitemap/max-items can never be smaller then controller/jobs/product/export/sitemap/max-query as only after all queried items are written at once through addItems() a check is executed an possibly a new file is created.

aimeos commented 5 years ago

Makes sense. Is there an elegant solution to make it possible?

s-diez commented 5 years ago

Maybe I can take another look next week. At the moment I don't have the time but I would expect you would possibly need to slice the array of items from the DB and keep track of the current number of items written to the current file.

s-diez commented 5 years ago

Hi sorry for the long delays. Because of the errors with the loop and the rewrite to integrate the catalog sitemap instead of my initial solution it will take some more time until I have the free time to make clean version.

aimeos commented 5 years ago

@s-diez Do you have some time to finish the PR?

s-diez commented 5 years ago

@aimeos I'll try to continue with it this week or maybe at the weekend.

s-diez commented 5 years ago

So I did take another go at the problem of the amount of generated files and the amount of items in them.

I could not find any elegant solution. As long as max-query and max-items are completely independent of each other and the files should always be filled to the maximum possible the generation is quite complex.

The function below should work, but honestly I find it too complex. Personally I would prefer limiting max-items to some multiple of max-query.

/**
     * Exports the products into the given container
     *
     * @param \Aimeos\MW\Container\Iface $container Container object
     * @param boolean $default True to filter exported products by default criteria
     * @return array List of content (file) names
     */
    protected function export( \Aimeos\MW\Container\Iface $container, $default = true )
    {
        $domains = $this->getConfig( 'domains', [] );
        $maxItems = $this->getConfig( 'max-items', 10000 );
        $maxQuery = $this->getConfig( 'max-query', 1000 );

        $start = 0; $filenum = 1; $itemsInFile = 0; $currentSliceOffset = 0; $total = 0; $exported = 0;
        $names = [];

        $manager = \Aimeos\MShop::create( $this->getContext(), 'index' );

        $search = $manager->createSearch( $default );
        $search->setConditions( $search->compare( '!=', 'index.catalog.id', null ) );
        $search->setSortations( array( $search->sort( '+', 'product.id' ) ) );
        $search->setSlice( 0, $maxQuery );

        $items = $manager->searchItems( $search, $domains, $total);
        if ($total === 0) {
            return $names;
        }
        $content = $this->createContent( $container, $filenum );
        $names[] = $content->getResource();

        $count = count($items);
        $start += $count;
        do {
            // no remaining items for current file
            if ($currentSliceOffset >= $count) {
                $search->setSlice( $start, $maxQuery );
                $items = $manager->searchItems( $search, $domains );
                $count = count($items);
                $start += $count;
                $currentSliceOffset = 0;
            }
            // create an array of as many items as the current file has remaining space
            if ($count > ($maxItems - $itemsInFile)) {
                $itemsForFile = array_slice($items, $currentSliceOffset, $maxItems - $itemsInFile, true);
                $currentSliceOffset += $maxItems - $itemsInFile;
            } else {
                $itemsForFile = $items;
                $currentSliceOffset = $count;
            }
            $this->addItems($content,$itemsForFile);
            $addedCount = count($itemsForFile);
            $itemsInFile += $addedCount;
            $exported += $addedCount;
            // there is still room => continue select items from db
            if ($itemsInFile < $maxItems) {
                continue;
            }
            // still items to export => close and open new file
            if ($itemsInFile >= $maxItems && $exported < $total) {
                $this->closeContent( $content );
                $content = $this->createContent( $container, ++$filenum );
                $names[] = $content->getResource();
                $itemsInFile = 0;
            }
        } while ($currentSliceOffset < $count || $start < $total); // still items to export

        $this->closeContent( $content );

        return $names;
    }
s-diez commented 5 years ago

Concerning the integration of the catalog urls into the product sitemap. I did a test of it in https://github.com/s-diez/ai-controller-jobs/commit/c290ae10a16380a84d7383113b04fd962a55470c. It would need some more corrections to comments and configuration documentation. But still I'm not sure, if I really like the general idea of integrating it into the product sitemap. Or did you have something else in mind on how to integrate it?

aimeos commented 5 years ago

You are right, the code is now more complex that it should be. If max-items is a multiple of max-query, that would be OK too.

aimeos commented 5 years ago

Regarding catalog sitemap, this should be definitively implemented in an own job controller so users have the option to generated it if needed or not. The name of the sitemap index must be added by hand to the robots.txt nevertheless.

s-diez commented 5 years ago

Regarding catalog sitemap, this should be definitively implemented in an own job controller so users have the option to generated it if needed or not. The name of the sitemap index must be added by hand to the robots.txt nevertheless.

I'm sorry but now I'm confused. Isn't this the way my initial implementation as visible in the pull-request is done? Now I don't really know what to actually implement or how to change my pull-request.

aimeos commented 5 years ago

Sorry, you are right it's already the case. If you want to merge it to the master branch, you have to adapt it a bit to the current code (mainly the factory method name). Also, the templates must end with ".php" and that extension must not be part of the render() call in the jobs class.

aimeos commented 5 years ago

There's also one issue in the code check (unused variable): https://scrutinizer-ci.com/g/aimeos/ai-controller-jobs/inspections/15ec0654-0c38-4f17-9c99-6e9972c29956/issues/files/controller/jobs/tests/Controller/Jobs/Catalog/Export/Sitemap/StandardTest.php?status=new&orderField=path&order=asc&honorSelectedPaths=0

s-diez commented 5 years ago

I did update my original Implementation to the newest Version and fixed the Tests.

I did use @since 2019.2 for all of my config options as I did not know in what version it will be introduced.

The Problem with the algorithm used to create the individual sitemap files should probably be discussed in a new merge requests, as it is used in many jobs. If you want I can try to create a simpler Version based on a Relationship between max-query and max-items. But it may take a bit till I find the time. For now my job suffers the same problem as all the others.

aimeos commented 5 years ago

Great! :-) We will open a new issue for the slicing problem

aimeos commented 5 years ago

Here's the new issue: https://github.com/aimeos/ai-controller-jobs/issues/13