dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
70 stars 39 forks source link

reads not clustering/mapping to reference genome #406

Closed joshuahallas closed 4 years ago

joshuahallas commented 4 years ago

Hi,

I am trying to filter ddrad reads to a reference genome, but it doesn't seem to be working properly. The ipyrad run finishes as it should (much quicker than expected) but there are large amounts of missing data (~40%). I have run this data before and I am getting a completely different answer even with the same parameters.

The stats file states that total_prefiltered_loc, filtered_by_rm_duplicates, and filtered_by_max_indels aren't being filtered out (see below). I have changed number of max indels to 1 see if I could filter out loci and I am unable to. I have started a denovo analysis with the same data/parameters, but it hasn't finished yet. I am currently running ipyrad v0.9.54 with python v3.6.7. I have tried older versions of the program and always have the same result. I have used an new reference fna file. I have looked through the output file and I do not have an error. It appears as everything runs as it should.

The only difference is I am using a HPC now. I am using 32 cores. I did encounter an error when I tried to use --MPI and ipcluster couldn't be used. I figured this was because I was using a single node.

I apologize for the long post, and appreciate the help.

-josh

total_filters applied_order retained_loci total_prefiltered_loci: 0, 0, 756599 filtered_by_rm_duplicates: 0, 0, 756599 filtered_by_max_indels: 0, 0, 756599 filtered_by_max_SNPs: 12572, 12572, 744027 filtered_by_max_shared_het: 1197, 763, 743264 filtered_by_min_sample: 641063, 641063, 102201 total_filtered_loci: 654832, 654398, 102201

isaacovercast commented 4 years ago

Hi Josh, I don't apprehend the problem here. RADSeq datasets are characterized by missingness, and 40% missing data is actually quite a low value, so you should be pleased. When you say "I have run this data before and am getting a completely different answer" can you be much more specific? What version did you run this data with before? What "answer" was different exactly? If there are no duplicates and no indels in the final data then there will be nothing to filter, so I don't see this as a problem either.

Finally, the functionality of ipyrad won't change if it's run on an HPC versus run on your laptop. MPI is an advanced feature, and you shouldn't need it unless you have a very large dataset. MPI should actually work much more reliably on a single node.

Since this isn't actually an ipyrad bug, i'm going to close this issue. If you want to chat about the problems you're having you can jump on our gitter channel: https://gitter.im/dereneaton/ipyrad