dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 40 forks source link

*.ustr file and other files not generated when using a genome as a reference in the assembly method #542

Open aroavaron opened 8 months ago

aroavaron commented 8 months ago

Hello,

I ran the program using denovo assembly strategy and then using a genome as a reference. For the first one, ipyrad generated 18 output files. However, when I used the genome, it only generated 14 output files. The missing files are: .migrate, .treemix, .ugeno, and, most importantly for me, .ustr. Has anyone encountered the same issue? Any suggestions or insights would be appreciated.

Cheers, A

isaacovercast commented 8 months ago

Did you check the output_formats param in the params file? The default value will not generate all output files. Can you please set the output_formats equal to * (which indicates to create all output formats), and run step 7 again for the reference assembly (with the -f flag to generate new outputs). Please let me know if that works.

aroavaron commented 8 months ago

Hi Issac, I should have mentioned that I used the "*" in the params to create all the output formats, but the run only generated 8 instead of 12 files. Thanks for your quick reply!

isaacovercast commented 8 months ago

What version of ipyrad are you running? (ipyrad -v will print the version) Using the most recent version (0.9.93) I ran a reference assembly from scratch on the simulated data and using * as the output_formats value gave me the full complement of output files.

If it is the most recent version of ipyrad please share your .json file with me that is in the project dir.

aroavaron commented 8 months ago

I ran ipyrad in the server and it's listed as ipyrad/0.9. I thought also that it could be an issue with the version, so I installed the latest version (ipyrad_0.9.93) in my miniconda environment. However, I'm currently encountering a different issue and troubleshooting it. The program starts, but it gets stuck at the first step. No files are generated, so I can't share the .json file yet!


ipyrad [v.0.9.93] Interactive assembly and analysis of RAD-seq data


Parallel connection | compute-65-17: 20 cores

Step 1: Loading sorted fastq data to Samples

aroavaron commented 8 months ago

Quick update. It was not running because of a lack of memory!

I'm running ipyrad for two species. For the first one ipyrad finished successfully and generated all the outputs. I used denovo assembly and the genome reference as filter (step #29). After comparing with my previous run using ipyrad 0.9.12 and the genome as a reference the amount of retained loci dropped from 22K to 6K and the amount of missing data for both the snps matrix and the sequence matrix decreased from 20.5% to 10.6%. Could it be possible that the newer version has different criteria?

Regarding the second species, ipyrad has not been able to pass step #6. Attached is the json file. I hope it helps to find out what could be the error. Fingers crossed for a quick and easy solution!

Thanks, A

vsref_fil_200k_85p_denovo.json

isaacovercast commented 8 months ago

@aroavaron In general the newest version of ipyrad should be trusted more than any previous version, for the fact that we are always fixing bugs. The difference in results between 0.9.93 and 0.9.12 (very old) is not so surprising. I would trust the newest version.

As for the second species, can you tell me what is the error you are getting during step 6? If you can show me all the command line output and the full error message when it dies that would be very helpful.

aroavaron commented 3 months ago

I ran the latest version of ipyrad (0.9.93) and used two assembly approaches. The reference approach resulted in 15,304 loci retained (26.3 % SNPs matrix missing sites / 28.1% sequences matrix missing sites), while the denovo-reference reference using the reference (in this case a genome) as filter approach (parameter #29) recovered 6,378 loci (14.6% SNPs matrix missing sites and 14.8% sequences matrix missing sites). For downstream analyses, it would be better to use the data with fewer missing values in general. However, I am curious about the reason(s) for the difference in the number of retained loci.

isaacovercast commented 3 months ago

I'm not sure i understand well what the two different assemblies were. In one case you did the 'reference' assembly using an 'on target' genome. In the 'denovo-reference' approach did you use this same genome sequences as the 'reference_as_filter' parameter? In general different assembly methods are doing quite different things so they will normally produce different results.

aroavaron commented 3 months ago

Yes, exactly! I used the same genome (at chromosome level of the species that I'm working on) for both approaches. I fully agree with you that different approaches would generate different results, but I would like to understand a little better what is going on, as I was not expecting a 42% drop in the number of loci retained using the second approach. Thank you for the quick reply!

isaacovercast commented 3 months ago

Well, the reference_as_filter removes any reads that map to the reference sequence, so the 6,378 loci you retained in this assembly are all the loci that don't map well to the reference (for whatever reason). Either they are off target, or the reference is distant from the focal taxon, or the assembly quality is not perfect. Does that help?