PHPOffice / PHPExcel

ARCHIVED
Other
11.47k stars 4.19k forks source link

HTML file validation fails on valid page #225

Open mzeddd opened 11 years ago

mzeddd commented 11 years ago

Hi,

I'm trying to fetch data from HTML page and read it as an Excel table

This is what I have (code from different examples combined together)


$url = "http://paraplan.ru/forum/jonatan.php";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$url); // set url to post to
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);// allow redirects
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); // return into a variable
curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_TIMEOUT, 3); // times out after 4s
$result = curl_exec($ch); // run the whole process curl_close($ch);

$outputFile = 'var/counter.html';

$fp = fopen($outputFile, 'w'); $test = fwrite($fp, $result); fclose($fp);

$objReader = PHPExcel_IOFactory::createReader('HTML'); $objPHPExcel = $objReader->load($outputFile);


With this code PHPExcel stops loading file because it is not valid HTML file.

I think you should check validation function. Possibly it is not working due to Russian content of the page (Cyrillic).

Currently in my code I disabled it to proceed with my experiments.

MarkBaker commented 11 years ago

I've just downloaded the file manually, read as HTML, and written using Excel2007, and it gives me exactly what I'd expect (including the Cyrillic characters) without any problems:

$inputFileType = 'HTML';
$inputFileName = './Cyrillic_UTF-8.htm';
$outputFileType = 'Excel2007';
$outputFileName = './Cyrillic_UTF-8.xlsx';

echo date('H:i:s')." Load Tests from $inputFileType file" , PHP_EOL;
$callStartTime = microtime(true);

try {
    $objPHPExcelReader = PHPExcel_IOFactory::createReader($inputFileType);
    $objPHPExcelReader -> setReadDataOnly(false);
    $objPHPExcel = $objPHPExcelReader->load($inputFileName);
} catch (Exception $e) {
   die ('ERROR: ' . $e->getMessage());
}

$callEndTime = microtime(true);
$callTime = $callEndTime - $callStartTime;
echo 'Call time to read Workbook was '.sprintf('%.4f',$callTime)." seconds" , PHP_EOL;
// Echo memory usage
echo date('H:i:s').' Current memory usage: '.(memory_get_usage(true) / 1024 / 1024)." MB" , PHP_EOL , PHP_EOL;

echo date('H:i:s')." Save as $outputFileType file" , PHP_EOL;
$callStartTime = microtime(true);

$objPHPExcelWriter = PHPExcel_IOFactory::createWriter($objPHPExcel,$outputFileType);
$objPHPExcel = $objPHPExcelWriter->save($outputFileName);

$callEndTime = microtime(true);
$callTime = $callEndTime - $callStartTime;
echo 'Call time to write Workbook was '.sprintf('%.4f',$callTime)." seconds" , PHP_EOL;
// Echo memory usage
echo date('H:i:s').' Current memory usage: '.(memory_get_usage(true) / 1024 / 1024)." MB" , PHP_EOL , PHP_EOL;

echo date('H:i:s').' Peak memory usage: '.(memory_get_peak_usage(true) / 1024 / 1024)." MB" , PHP_EOL , PHP_EOL;
mzeddd commented 11 years ago

Can you try to save page using CURL as I did?

MarkBaker commented 11 years ago

No problems using curl to retrieve the file either. Are you using the latest develop branch?

omicode commented 8 years ago

I try your method to save html table with cyrillic to xsl, but cyrillic word is broken.

    $table = $this->displayTable($rows);
    $tmpfile = tempnam(sys_get_temp_dir(), "html");
    file_put_contents($tmpfile, $table);

try { $excelHTMLReader = PHPExcel_IOFactory::createReader("HTML"); // Set active sheet index to the first sheet, so Excel opens this as the first sheet $excelHTMLReader->setReadDataOnly(false); $objPHPExcel = $excelHTMLReader->load($tmpfile); } catch (Exception $e) { die ('ERROR: ' . $e->getMessage()); }

    $objPHPExcel->getProperties()->setCreator("BPage 2.0")
                                 ->setLastModifiedBy("BPage 2.0")
                                 ->setTitle($title);

    // Rename worksheet
    $objPHPExcel->getActiveSheet()->setTitle($title);

    unlink($tmpfile);

    if ($outputType == "Excel5") {
        // Redirect output to a client’s web browser (Excel5)
        header("Content-Type: application/vnd.ms-excel");
        header("Content-Disposition: attachment;filename=\"rep-".today(DATE_BG).".xls\"");
    } else {
        // Redirect output to a client’s web browser (Excel2007)
        header("Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet; charset=UTF-8;"); // charset=UTF-8
        header("Content-Disposition: attachment;filename=\"rep-".today(DATE_BG).".xlsx\"");
    }
    header('Cache-Control: max-age=0');
    header('Cache-Control: max-age=1');
    header('Expires: Mon, 26 Jul 1997 05:00:00 GMT'); // Date in the past
    header('Last-Modified: '.gmdate('D, d M Y H:i:s').' GMT'); // always modified
    header('Cache-Control: cache, must-revalidate'); // HTTP/1.1
    header('Pragma: public'); // HTTP/1.0

    $objWriter = PHPExcel_IOFactory::createWriter($objPHPExcel, $outputType);
    $objWriter->save('php://output');

How resolve my problem in Render/HTML.php replase $loaded = $dom->loadHTML($this->securityScanFile($pFilename)); with $loaded = $dom->loadHTML(mb_convert_encoding($this->securityScanFile($pFilename), 'HTML-ENTITIES', 'UTF-8')); and all work correct.