duzun / hQuery.php

An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
https://duzun.me/playground/hquery
MIT License
361 stars 74 forks source link

It is skipping empty data with find_text() method. How to get empty data also? #48

Closed JeevanSarikonda closed 5 years ago

JeevanSarikonda commented 5 years ago

Am scraping data from the table, which is having 4 columns. So when I used find_text()method it is giving 4 columns data as an array, but if any of <td> is having empty data then it is skipping that and giving less number of items in the array which causing problem.

Please help me out.

Thanks.

duzun commented 5 years ago

An example of the table's HTML and some PHP code or selector you are using would be helpful to understand the context of the issue.

JeevanSarikonda commented 5 years ago
<section class="data-section data-section-older section-closed ng-scope" ng-if="item.historyOlderStatus">
            <div class="header cselem" ng-click="headerToggle($event)" data-cs-type="click" data-cs-name="3b.payment_history.4">
                <div class="title">Older Derogatory Events<i class="icon-expand"><a href="javascript:;" aria-expanded="false"><svg role="img" aria-label="expand or collapse older derogatory events"><use xlink:href="#icon-collapse"></use></svg></a></i></div>
            </div>
            <div class="body" style="display: none;">
                <table class="body-content">
                    <tbody><tr class="bureau-row">
                        <td class="col-labels"></td>
                        <th class="col-value"><span class="bureau-name">Equifax</span></th>
                        <th class="col-value"><span class="bureau-name">TransUnion</span></th>
                        <th class="col-value"><span class="bureau-name">Experian</span></th>
                    </tr>
                    <tr class="data-set">
                        <th scope="row" class="col-labels col-lbl-30dl"><div class="data-label">30 Days Late</div></th>
                        <td class="col-value"><div class="data-value ng-binding" ng-bind-html="item.efx.historyOlder[0].late30">12/2018, 10/2018</div></td>
                        <td class="col-value"><div class="data-value ng-binding" ng-bind-html="item.tu.historyOlder[0].late30">12/2018, 10/2018</div></td>
                        <td class="col-value"><div class="data-value ng-binding" ng-bind-html="item.exp.historyOlder[0].late30"></div></td>
                    </tr>
                    <tr class="data-set">
                        <th scope="row" class="col-labels col-lbl-60dl"><div class="data-label">60 Days Late</div></th>
                        <td class="col-value"><div class="data-value ng-binding" ng-bind-html="item.efx.historyOlder[0].late60">01/2019</div></td>
                        <td class="col-value"><div class="data-value ng-binding" ng-bind-html="item.tu.historyOlder[0].late60">01/2019</div></td>
                        <td class="col-value"><div class="data-value ng-binding" ng-bind-html="item.exp.historyOlder[0].late60"></div></td>
                    </tr>
                    <tr class="data-set">
                        <th scope="row" class="col-labels col-lbl-90dl"><div class="data-label">90 Days Late</div></th>
                        <td class="col-value"><div class="data-value ng-binding" ng-bind-html="item.efx.historyOlder[0].late90">03/2019, 02/2019</div></td>
                        <td class="col-value"><div class="data-value ng-binding" ng-bind-html="item.tu.historyOlder[0].late90">03/2019, 02/2019</div></td>
                        <td class="col-value"><div class="data-value ng-binding" ng-bind-html="item.exp.historyOlder[0].late90"></div></td>
                    </tr>
                </tbody></table>
            </div>
        </section>

**And my PHP code is:**

$particles = $doc->find_text('table .data-set');
foreach ($particles as $item)
{
    $item = preg_split( '/\r\n|\r|\n/', $item);
    $item = array_map('trim',$item);
    $item = array_values(array_filter($item));
    $nodes[] = $item;
}
print_r($nodes);
duzun commented 5 years ago

Try this:

    $particles = $doc->find('table .data-set');
    $nodes = [];
    foreach ($particles as $tr)
    {
        $item = [];
        foreach($tr->children() as $td) {
            $item[] = trim($td->text());
        }
        $nodes[] = $item;
    }
    print_r($nodes);

One issue in your approach is array_filter($item), which filters out the empty rows. The other issue is $item = preg_split( '/\r\n|\r|\n/', $item); which relies on HTML formatting instead of semantics/structure to extract data. What if there are no newlines in HTML?

JeevanSarikonda commented 5 years ago

Thank you. Will try it and update you

JeevanSarikonda commented 5 years ago

Working fine. Nice Library Thank you @duzun