carnegie / PlantClusterFinder

GNU General Public License v3.0
13 stars 1 forks source link

non-existent field 'Col2'. #5

Closed Frybank closed 8 months ago

Frybank commented 8 months ago

Dear development team, when I was running the software, I was prompted with an error:

Check files for reading and writing Reference to non-existent field 'Col2'. Error in PlantClusterFinder> f_read_in_ConversionFile (line 2821) Error in PlantClusterFinder (line 806)

I have set up the environment according to the software instructions, run the test data, and annotate the E2P2 enzyme through the protein sequence data of the test data, and all of them have been successfully run. error message glof gtpf pep fasta

I object to the fact that the TranscriptNAME and ProteinNAME of the gtpf file are identical. I made gtpf and glof files according to the corresponding format, but they all showed failure.

Sincerely hope to get your reply.

bxuecarnegie commented 8 months ago

The error raised by "f_read_in_ConversionFile" would be related to the gene transcript protein mapping file. The function cannot find a second column meaning that it failed to parse the lines by using tabs. Your file seemingly does have 3 columns, but can you check they are tab-delimited not other white spaces?

Frybank commented 8 months ago

The error raised by "f_read_in_ConversionFile" would be related to the gene transcript protein mapping file. The function cannot find a second column meaning that it failed to parse the lines by using tabs. Your file seemingly does have 3 columns, but can you check they are tab-delimited not other white spaces?

Thank you for your reply. I have modified it, but there is a new problem, Error message Error using PlantClusterFinder>f_remove_protein_transcript_info_from_genepositionfile (line 3036) Gene / Transcript / Protein location file (Gene position file) has an entry gene.Vf0100001 55585 62629 Chr1 1 that is not covered by your gene ID conversion file. Please add the entry.

I don't understand why this phenomenon occurs? Is this a reminder that there is a problem with the glof file?Does the glof format also require TAB separation? Or is it something else?

bxuecarnegie commented 8 months ago

yes, all files should be tab-delimited

Frybank commented 8 months ago

yes, all files should be tab-delimited I have changed the format, but when the software runs, it tells me that the index dimension is out of bounds, and the error message is as follows: In PlantClusterFinder (line 976) Index exceeds matrix dimensions. Error in PlantClusterFinder>f_extract_results_with_header (line 4398) Error in PlantClusterFinder (line 976) Sincerely hope to get your reply

bxuecarnegie commented 8 months ago

The line resulting in error is trying to read in your gene location file (glof). So there might be some formatting issues in the file. See if every line has the same number of columns (including the header)

Frybank commented 8 months ago

The line resulting in error is trying to read in your gene location file (glof). So there might be some formatting issues in the file. See if every line has the same number of columns (including the header)

Dear author, according to your tips, I modified my data format, and now it can run, but it still prompts an error message: Too many input arguments. Error in PlantClusterFinder>f_get_Sequencing_Gaps (line 2168) Error in PlantClusterFinder>f_annotate_Sequencing_Gaps (line 1984) Error in PlantClusterFinder (line 1124) MATLAB:TooManyInputs

This message is related to setting the parameter SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP Sincerely looking forward to your reply

Frybank commented 8 months ago

The line resulting in error is trying to read in your gene location file (glof). So there might be some formatting issues in the file. See if every line has the same number of columns (including the header)

In fact, I have successfully run the program now, but the output result of Cluster has no result. Is it because I set the parameter SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP according to the example data? In this regard, I would like to ask about the setting of the SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP parameter. Can I leave this parameter unset and keep the default? Will this get as many clusters as possible

bxuecarnegie commented 8 months ago

Since I'm not sure what your input is you can of course use the default value.

Frybank commented 8 months ago

Is the data in the Input folder of the default sample code generated based on the sample data? Or is this data in the Input folder common to all code? Such as in the example code - RMDF ReactionMetabolicDomainClassification. TXT, - stif scaffold - tailoring - reactions - 05082016. The TAB this data? There is no value in the Cluster file in my output, I wonder if it is the cause of these data? Here is my output

微信图片_20231108153328 微信图片_20231108153333

bxuecarnegie commented 8 months ago

The example code are used for running the pipeline on the csubellipsoidea data to generate the expected output. Your data might require different parameters

Frybank commented 8 months ago

I'm sorry, maybe I'm not clear enough, but what I mean is the data in the input folder that comes with the software, Use in the sample data **- rmdf', '\ [PlantClusterFinder] Inputs \ ReactionMetabolicDomainClassification TXT

Are these sample data for csubellipsoideacyc alone, or can they be used by other PGDBS? I obtained the data according to E2P2 protein annotation, Pathwaytools23.0 version generated PGDB, as the input PGDB to run PlantClusterFinder, but the output result is not ideal, Cluster has no data, Therefore, I would like to ask whether the problem is the input of these two -rmdf-sitf data or I have to set parameters? In fact, I ran the program with the default parameters

bxuecarnegie commented 8 months ago

Yes, they should be used for your own input. Although they are out of date, it's unlikely they would result in 0 data.

bxuecarnegie commented 8 months ago

Apart from adjusting parameters for the pipleine, another thing you can do is check and compare the intermediate output (such as GAPoutput and MCL clustering) with the example and see if there are errors in them.

Frybank commented 8 months ago

Apart from adjusting parameters for the pipleine, another thing you can do is check and compare the intermediate output (such as GAPoutput and MCL clustering) with the example and see if there are errors in them.

Following your instructions, I did find what appeared to be an unreasonable situation in the GAP_output file, I would like to ask you, what is the cause of this? I seem to have followed the instructions, Does that mean something is wrong with some of our sequencing data?

微信图片_20231108201422 This is my GAP output

微信图片_20231108201425 This is the test data

bxuecarnegie commented 8 months ago

For every N-sequence-gap to be the same is highly unlikely. You should check your sequence file if there really are 500 “N”s in every occurrence.

Frybank commented 8 months ago

For every N-sequence-gap to be the same is highly unlikely. You should check your sequence file if there really are 500 “N”s in every occurrence.

May I ask whether the number of genes in glof and the number of genes in gtpf should be absolutely the same It seems that if the gene for glof does not appear in gtpf's gene_ID, an error will be reported: that is not covered by your gene ID conversion file.

bxuecarnegie commented 8 months ago

they should

Frybank commented 8 months ago

they should When I re-ran PlantClusterFinder with a new set of reliable data, the following error occurred. In PlantClusterFinder (line 859) Starting parallel pool (parpool) using the 'local' profile ... connected to 99 workers. Error using parfor_endpoint_check (line 12) The endpoint of a parfor range must be an integer. See Parallel Computing Toolbox, "parfor" Error in PlantClusterFinder>f_get_Sequencing_Gaps (line 2143) Error in PlantClusterFinder>f_annotate_Sequencing_Gaps (line 1984) Error in PlantClusterFinder (line 1124) MATLAB:parfor:range_endpoint Look forward to your guidance

bxuecarnegie commented 8 months ago

Have you set a separate '-para' value? Unless there changes on your system itself between runs, it's hard for me to know what caused this problem as it doesn't look like it's caused by the script itself

Frybank commented 8 months ago

Have you set a separate '-para' value? Unless there changes on your system itself between runs, it's hard for me to know what caused this problem as it doesn't look like it's caused by the script itself

Maybe I set the -para problem, I go to modify it, run again. thank you very much for your guidance!

Frybank commented 8 months ago

Have you set a separate '-para' value? Unless there changes on your system itself between runs, it's hard for me to know what caused this problem as it doesn't look like it's caused by the script itself

Dear author, I used another set of data this time, and no errors were reported during the whole operation. Apart from suggesting that part of scaffo was not covered by the genome, no error information was reported. However, the result output file Cluster file was still without any output results, as in previous problems. I checked GAPoutput and compared the test data, which seemed to be no problem, but I found that there was a big difference between the memex.txt output file and the reference data, and there was no EC or RXN data output. Obviously, this is abnormal, and I am looking forward to your guidance very much 1d338aa2a4e146be4a1761f64df947e 57cbb43694e8f191d88535f7d8879ee

bxuecarnegie commented 8 months ago

If no EC nor RXN are pulled out from the pgdb flat files, it's likely that the Gene/Transcript/Protein names aren't matching the attributes found in those files. For example, your gene ids should be able to match entries in the genes.dat ('UNIQUE-ID', 'PRODUCT', 'ACCESSION-1' ,'ACCESSION-2') file for the script to pull those annotations from the pgdb.

Frybank commented 8 months ago

If no EC nor RXN are pulled out from the pgdb flat files, it's likely that the Gene/Transcript/Protein names aren't matching the attributes found in those files. For example, your gene ids should be able to match entries in the genes.dat ('UNIQUE-ID', 'PRODUCT', 'ACCESSION-1' ,'ACCESSION-2') file for the script to pull those annotations from the pgdb. Following your instructions, I rebuilt PGDB and ran PCF with the following error: In PlantClusterFinder>f_get_metabolic_domains (line 2340) In PlantClusterFinder (line 859) Error using PlantClusterFinder>f_analyze_PlantClusterGapFile Too many input arguments. Error in PlantClusterFinder>f_get_Sequencing_Gaps (line 2168) Error in PlantClusterFinder>f_annotate_Sequencing_Gaps (line 1984) Error in PlantClusterFinder (line 1124) Is it too much data? Shouldn't that be a problem in theory? Looking forward to your reply MATLAB:TooManyInputs

Frybank commented 8 months ago

If no EC nor RXN are pulled out from the pgdb flat files, it's likely that the Gene/Transcript/Protein names aren't matching the attributes found in those files. For example, your gene ids should be able to match entries in the genes.dat ('UNIQUE-ID', 'PRODUCT', 'ACCESSION-1' ,'ACCESSION-2') file for the script to pull those annotations from the pgdb.

It seems that such a problem has occurred before, but I set it according to the default Settings, I am very confused, why there is such a problem again?

bxuecarnegie commented 8 months ago

"Too much data" shouldn't be a problem. Since that function reads in the GAPOutput file. Check if the file is correctly generated, and make sure there aren't any spaces in their paths.

Frybank commented 8 months ago

I wonder if there is a problem with PGDB, because orxn.pf has an error message in the pathological building of PGDB output PGDB files with E2P2 output: No variation of the gene sequence was found, but I tried to run it with the test data and also had this error message, but PCF came up with the final result, which was consistent with the test results that came with the software. When debugging this glof and gtpf before, it can run completely without producing errors, but no output results, after this modification, the running result does not produce GAP_output data,This is my PGDB, built with version 23.0 of Pathway Tools,Looking forward to your reply

https://github.com/Frybank/PGDB/releases/download/v1.0.0/data.rar

bxuecarnegie commented 8 months ago

The GAP_output file doesn't require pgdb, it generates using your fasta file

Frybank commented 8 months ago

The GAP_output file doesn't require pgdb, it generates using your fasta file My fasta file and produced this GAP_output file, but the software still reported an error, still prompt: In PlantClusterFinder (line 859) Error using PlantClusterFinder> f_analyze_PlantClusterGapFile Too many input arguments. Error in PlantClusterFinder> f_get_Sequencing_Gaps (line 2168) Error in PlantClusterFinder> f_annotate_Sequencing_Gaps (line 1984) Error in PlantClusterFinder (line 1124) MATLAB:TooManyInputs Run the instructions, following the README file instructions ./run_PlantClusterFinder.sh /usr/local/MATLAB/MATLAB_Runtime/v91 -pgdb "/home/wyh/software/PlantClusterFinder/lcucyc/1.0/data" -rmdf "/home/wyh/software/PlantClusterFinder/Inputs/ReactionMetabolicDomainClassification.txt" -md "{'Amines and Polyamines Metabolism'; 'Amino Acids Metabolism'; 'Carbohydrates Metabolism'; 'Cofactors Metabolism'; 'Detoxification Metabolism'; 'Energy Metabolism'; 'Fatty Acids and Lipids Metabolism'; 'Hormones Metabolism'; 'Inorganic Nutrients Metabolism'; 'Nitrogen-Containing Compounds'; 'Nucleotides Metabolism'; 'Phenylpropanoid Derivatives'; 'Polyketides'; 'Primary-Specialized Interface Metabolism'; 'Redox Metabolism'; 'Specialized Metabolism'; 'Sugar Derivatives'; 'Terpenoids'}" -psf "/home/wyh/software/PlantClusterFinder/Lcuevm.pep.fasta" -gtpf "/home/wyh/software/PlantClusterFinder/gtpfx.txt" -glof "/home/wyh/software/PlantClusterFinder/glofx.txt" -dnaf "/home/wyh/software/PlantClusterFinder/c.genome.fasta" -sitf "/home/wyh/software/PlantClusterFinder/Inputs/scaffold-tailoring-reactions-05082016.tab" -gout "/home/wyh/software/PlantClusterFinder/Lcu1_3_memex.txt" -cout "/home/wyh/software/PlantClusterFinder/LcuClust1_3_memex.txt" I filtered the gtpf and glof files, the same number, and both generated with TAB separations,The genome data and protein data, which were published in NC, are theoretically not problematic https://github.com/Frybank/PGDB/blob/main/glofx.txt https://github.com/Frybank/PGDB/blob/main/gtpfx.txt

bxuecarnegie commented 8 months ago

From the files alone I cannot discern any problems. Have you tried putting back the SeqGapSizesChromBreak argument? Also, are the headers of your sequence files only containing their IDs?

Frybank commented 8 months ago

From the files alone I cannot discern any problems. Have you tried putting back the SeqGapSizesChromBreak argument? Also, are the headers of your sequence files only containing their IDs?

Ok, maybe I should try to run it with SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP

Frybank commented 8 months ago

From the files alone I cannot discern any problems. Have you tried putting back the SeqGapSizesChromBreak argument? Also, are the headers of your sequence files only containing their IDs? I set the SeqGapSizesChromBreak'[10000]' PGDBIdsToMap GTP parameter, but I got this error: Error using fgets Invalid file identifier. Use fopen to generate a valid file identifier. Error in fgetl (line 33) Error in PlantClusterFinder>f_annotate_Sequencing_Gaps (line 1990) Error in PlantClusterFinder (line 1124) MATLAB:FileIO:InvalidFid Is this problem due to this problem in my gtpf?The second and third columns can only be numbers, and nothing else. 2f6d8759474d4dc968ba9084b57ca14

bxuecarnegie commented 8 months ago

No, it can be characters. The problem here is that the script cannot find a file that's needed. It's possible the file path you provided is incorrect, or, if the argument you pasted is exactly what you did, missing a space after 'Break'

Frybank commented 8 months ago

No, it can be characters. The problem here is that the script cannot find a file that's needed. It's possible the file path you provided is incorrect, or, if the argument you pasted is exactly what you did, missing a space after 'Break' This is my script ./run_PlantClusterFinder.sh /usr/local/MATLAB/MATLAB_Runtime/v91 -pgdb "/home/wyh/software/PlantClusterFinder/lcucyc/1.0/data" -rmdf "/home/wyh/software/PlantClusterFinder/Inputs/ReactionMetabolicDomainClassification.txt" -md "{'Amines and Polyamines Metabolism'; 'Amino Acids Metabolism'; 'Carbohydrates Metabolism'; 'Cofactors Metabolism'; 'Detoxification Metabolism'; 'Energy Metabolism'; 'Fatty Acids and Lipids Metabolism'; 'Hormones Metabolism'; 'Inorganic Nutrients Metabolism'; 'Nitrogen-Containing Compounds'; 'Nucleotides Metabolism'; 'Phenylpropanoid Derivatives'; 'Polyketides'; 'Primary-Specialized Interface Metabolism'; 'Redox Metabolism'; 'Specialized Metabolism'; 'Sugar Derivatives'; 'Terpenoids'}" -psf "/home/wyh/software/PlantClusterFinder/Lcuevm.pep.fasta" -gtpf "/home/wyh/software/PlantClusterFinder/gtpfx.txt" -glof "/home/wyh/software/PlantClusterFinder/glofx.txt" -dnaf "/home/wyh/software/PlantClusterFinder/c.genome.fasta" -sitf "/home/wyh/software/PlantClusterFinder/Inputs/scaffold-tailoring-reactions-05082016.tab" -gout "/home/wyh/software/PlantClusterFinder/LcuGene.txt" -cout "/home/wyh/software/PlantClusterFinder/LcuClust.txt" SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP Space exists in break

Frybank commented 8 months ago

ok, maybe I should try again

Frybank commented 8 months ago

Space exists in break I tried to follow your prompts and added the space in the middle of SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP. The problem remains Error using fgets Invalid file identifier. Use fopen to generate a valid file identifier. Error in fgetl (line 33) Error in PlantClusterFinder>f_annotate_Sequencing_Gaps (line 1990) Error in PlantClusterFinder (line 1124) MATLAB:FileIO:InvalidFid

bxuecarnegie commented 8 months ago

Since I’ve never encountered such errors. Can you first rerun the example data so to confirm it’s still working on your machine. Next add the argument “ Verbose 1” to your own command so we can have more info of your run.

Frybank commented 8 months ago

Since I’ve never encountered such errors. Can you first rerun the example data so to confirm it’s still working on your machine. Next add the argument “ Verbose 1” to your own command so we can have more info of your run.

I have run the test data, but there is still no problem, and it can be run. As for the parameter of Verbose1 you said to be added to my command, I don't understand how to add it. Error using fgets Invalid file identifier. Use fopen to generate a valid file identifier. Error in fgetl (line 33)
This error message says the command fgets, I looked at other questions, the author said it may be a permission problem, I gave chmod+777, and then again, if I do not have SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP in the command, Is completely run, no error, but once added this command, this error will appear, is it possible that SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP problem?

bxuecarnegie commented 8 months ago

without the commands, does the script produce the expected results of the example?

Frybank commented 8 months ago

without the commands, does the script produce the expected results of the example? I tried, but without SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP cannot output Cluster data, Error using fgets is reported Invalid file identifier. Use fopen to generate a valid file identifier. Error in fgetl (line 33), what do you mean when you say Verbose adds to my command? I don't understand, okay?

bxuecarnegie commented 8 months ago

The end of your command would be "PGDBIdsToMap GTP Verbose 1". I'm going to ask you again, did you get the expected results when running the example

Frybank commented 8 months ago

The end of your command would be "PGDBIdsToMap GTP Verbose 1". I'm going to ask you again, did you get the expected results when running the example

SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP

I followed the example command and got the expected result, SeqGapSizesChromBreak '[10000]' PGDBIdsToMap GTP, you are right, this parameter is required

Frybank commented 8 months ago

The end of your command would be "PGDBIdsToMap GTP Verbose 1". I'm going to ask you again, did you get the expected results when running the example The error message is as follows: Calculate size of sequence gap that should be populated by hypothetical genes Identify sequencing gaps (bases encoded by N) Error using fgets Invalid file identifier. Use fopen to generate a valid file identifier. Error in fgetl (line 33) Error in PlantClusterFinder>f_annotate_Sequencing_Gaps (line 1990) Error in PlantClusterFinder (line 1124) MATLAB:FileIO:InvalidFid

bxuecarnegie commented 8 months ago

The following values that are related to files in f_annotate_Sequencing_Gaps are: the masked dna file, and the "_GAPoutput" file generated from it. From the code itself it's likely related to the GAPoutput file, but based on your reply it's generated. To easier pinpoint the errors, can you do a clean run this time. So remove all the temporary files generated during the runs, such as the temp files and GAPOutput, plus use "Verbose 2" we have more information. Next check if this time the GAOutput is generated and the file's content is correct, e.g. does the file start with empty lines? Is it in the same 4 column structure like "CsubellipsoideaC_169_227_v2.0.hardmasked.fa_GAPOutput" etc.

Frybank commented 8 months ago

The following values that are related to files in f_annotate_Sequencing_Gaps are: the masked dna file, and the "_GAPoutput" file generated from it. From the code itself it's likely related to the GAPoutput file, but based on your reply it's generated. To easier pinpoint the errors, can you do a clean run this time. So remove all the temporary files generated during the runs, such as the temp files and GAPOutput, plus use "Verbose 2" we have more information. Next check if this time the GAOutput is generated and the file's content is correct, e.g. does the file start with empty lines? Is it in the same 4 column structure like "CsubellipsoideaC_169_227_v2.0.hardmasked.fa_GAPOutput" etc. After adding the "Verbose2" parameter, the following error message was displayed, and then according to your instructions, I checked the GAP_output file, and did generate the GAP_output file as the example file. I noticed that the number of "N" in the temp file of the genome was very small, only 87. However, the number of "N" in the test data is very large, as high as 19,000, I wonder if there is a problem here? Looking forward to your reply very much Next command running: /home/wyh/miniconda3/bin/awk -f enter_new_line_characters_in_fastafile.awk '/home/wyh/software/PlantClusterFinder/c.genome.fasta'>'/home/wyh/software/PlantClusterFinder/c_.genome.fastatemp1' Status was 0 Output was: Next command running: ' < '/home/wyh/software/PlantClusterFinder/c.genome.fastatemp1'>'/home/wyh/software/PlantClusterFinder/c.genome.fastatemp2' Status was 0 Output was: Next command running: tr -d ' ' < '/home/wyh/software/PlantClusterFinder/c.genome.fastatemp2'>'/home/wyh/software/PlantClusterFinder/c.genome.fasta_temp3' Status was 0 Output was: Next command running: /home/wyh/miniconda3/bin/awk -f get_new_linein.awk '/home/wyh/software/PlantClusterFinder/c.genome.fastatemp3'>'/home/wyh/software/PlantClusterFinder/c.genome.fasta_temp4' Status was 0 Output was: Next command running: /home/wyh/miniconda3/bin/awk -f get_positions_ofgap.awk '/home/wyh/software/PlantClusterFinder/c.genome.fastatemp4'>'/home/wyh/software/PlantClusterFinder/c.genome.fasta_GAPOutput' Status was 0 Output was: Error using PlantClusterFinder>f_analyze_PlantClusterGapFile Too many input arguments. Error in PlantClusterFinder>f_get_Sequencing_Gaps (line 2168) Error in PlantClusterFinder>f_annotate_Sequencing_Gaps (line 1984) Error in PlantClusterFinder (line 1124) MATLAB:TooManyInputs aec3bf6734e1b5f221d3fb66178e155

Frybank commented 8 months ago

The following values that are related to files in f_annotate_Sequencing_Gaps are: the masked dna file, and the "_GAPoutput" file generated from it. From the code itself it's likely related to the GAPoutput file, but based on your reply it's generated. To easier pinpoint the errors, can you do a clean run this time. So remove all the temporary files generated during the runs, such as the temp files and GAPOutput, plus use "Verbose 2" we have more information. Next check if this time the GAOutput is generated and the file's content is correct, e.g. does the file start with empty lines? Is it in the same 4 column structure like "CsubellipsoideaC_169_227_v2.0.hardmasked.fa_GAPOutput" etc. I captured the number of N in my GAP_output, which is more than 3000, and the test data is more than 4000. Theoretically, is this number proportional to the N in the expected temp?